I'm surprised at how similar all of them are with the main differences being the size of layers.
Most of the arch work is just scaling knobs.
If you swap in wierd layer types or move the objective much people run into ugly failure modes fast, so the field keeps circling the same Transformer blocks and then markets the change as novel when it's mostly a trianing and compute tradeoff.