Thanks for the note about Qwen3.5. I should keep up with this more. If only it were more relevant to my day to day work with LLMs!
I did consider MoEs but decided (pretty arbitrarily) that I wasn’t going to count them as a truly fundamental change. But I agree, they’re pretty important. There’s also RoPE too, perhaps slightly less of a big deal but still a big difference from the earlier models. And of course lots of brilliant inference tricks like speculative decoding that have helped make big models more usable.