I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
I'm curious if someone here with a stronger background in the space has a similar intuition or not.
There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.
But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in narrow "summarize the search results" sorts of applications - but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.
I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some highly advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.
But for some classes of problems I think a model that is 10-100x smarter than the smartest expert is a huge boon. These would be problems that are very hard to solve but easy to verify that the solution is correct. Protein folding, sudoku, etc. Because of this I see the really smart models going to biomedical and pharma first and maybe a few high profit verticals rather than being widely deployed. I am sure Pfizer would be happy to pay for a 100x smarter than the smartest researcher model. But I am not certain that this kind of market fit would justify trillion dollar valuations in the long run. And in the meantime normal “human companion” models will go from Sonnet to some open weight model running on a Dell tower in your closet to maybe even on your phone in the next few years.
the relationship should be the opposite, the smartest people can write the most readable solutions
Of course perhaps at that point I really do become more of a spec and prompt engineer and don’t actually look at the code any more than I look at the assembly code produced from my programs now. But still my gut says using hyperintelligence to do common tasks is all positive.
The latter is much better (since you can clean up, review, update responses and filter your datasets).
I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)
Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].
I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.
[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):
https://news.ycombinator.com/item?id=48165265
[2] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:
https://arxiv.org/abs/2601.18734
[4] Embarrassingly simple self-distillation improves code generation (201 comments):
https://news.ycombinator.com/item?id=47637757
[5] Embarrassingly Simple Self-Distillation Improves Code Generation:
Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".
Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights
The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.
It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though
A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.
A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.
I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.
You don't need distillation. They already have the training sets.
It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).
And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..
LLMs are themselves copy cats.
I say thanks for open sourcing and thereby promoting affordable innovation, instead of "nefarious". :)
On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.
Yes, variants typically 2-3x less good...
Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.