Story Detail of id 48317980 | Liveview Hacker News

teleforce14 hours ago | on: Claude Opus 4.8

Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].

Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].

I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.

[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):

https://news.ycombinator.com/item?id=48165265

[2] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[4] Embarrassingly simple self-distillation improves code generation (201 comments):

https://news.ycombinator.com/item?id=47637757

[5] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193

rao-v10 hours ago | parent

So first - these are terrific papers and I'd not seen some of them before.

Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".

#visit	13,438,374
#session	74,665
#live-session	0