Hacker News new | past | comments | ask | show | jobs | submit
Where are you seeing dense? Most of the larger competitive models are sparse. Sure, the smaller models are dense, but over 30B it's pretty much all sparse MoE.

And there are still plenty of hybrid architectures. Nemotron 3 Super 120B A12B just came out, it's mostly Mamba with a few attention layers, and it's pretty competitive for its size class.

But yeah, these different architectures seem to be relatively small micro-optimizations for how it performs on different hardware or difference in tradeoffs for how it scales with the context window, but most of the actual differentiation seems to be in training pipeline.

We are seeing substantial increases in performance without continuing to scale up further, we've hit 1T parameters in open models but are still having smaller models outperform that with better and better training pipelines.