Hacker News new | past | comments | ask | show | jobs | submit
This is great - always worth reading anything from Sebastian. I would also highly recommend his Build an LLM From Scratch book. I feel like I didn’t really understand the transformer mechanism until I worked through that book.

On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area. The best open weight models today still look a lot like GPT-2 if you zoom out: it’s a bunch of attention layers and feed forward layers stacked up.

Another way of putting this is that astonishing improvements in capabilities of LLMs that we’ve seen over the last 7 years have come mostly from scaling up and, critically, from new training methods like RLVR, which is responsible for coding agents going from barely working to amazing in the last year.

That’s not to say that architectures aren’t interesting or important or that the improvements aren’t useful, but it is a little bit of a surprise, even though it shouldn’t be at this point because it’s probably just a version of the Bitter Lesson.

> On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area.

After years of showing up in papers and toy models, hybrid architectures like Qwen3.5 contain one such fundamental innovation - linear attention variants which replace the core of transformer, the self-attention mechanism. In Qwen3.5 in particular only one of every four layers is a self-attention layer.

MoEs are another fundamental innovation - also from a Google paper.

Thanks for the note about Qwen3.5. I should keep up with this more. If only it were more relevant to my day to day work with LLMs!

I did consider MoEs but decided (pretty arbitrarily) that I wasn’t going to count them as a truly fundamental change. But I agree, they’re pretty important. There’s also RoPE too, perhaps slightly less of a big deal but still a big difference from the earlier models. And of course lots of brilliant inference tricks like speculative decoding that have helped make big models more usable.

I'd push back slightly on the "no fundamental innovations" read though — the innovations that stuck (MoE, GQA, RoPE) are almost entirely ones that improve GPU utilization: better KV-cache efficiency, more parallelism in attention, cheaper to serve per parameter. Mamba and SSM-based hybrids are interesting but kept running into hardwar friction.