Story Detail of id 48312485 | Liveview Hacker News

onlyrealcuzzo22 hours ago | on: Claude Opus 4.8

> I don't disagree, but how much of this ends up being distillation?

You don't need distillation. They already have the training sets.

It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).

Philpax22 hours ago | parent | next

It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.

semiquaver21 hours ago | parent | next

The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data

coldtea21 hours ago | root | parent | next

>It’s not just something done by nefarious Chinese copycats

And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..

flossly16 hours ago | root | parent | next

> nefarious Chinese copycats

LLMs are themselves copy cats.

I say thanks for open sourcing and thereby promoting affordable innovation, instead of "nefarious". :)

manmal21 hours ago | root | parent | next

But how? The training data is the unadulterated content those models are based on? I genuinely don’t understand, no snark.

loading story #48316830

supern0va21 hours ago | root | parent

I think you replied to the wrong parent.

minimaltom22 hours ago | parent

Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.

On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.

onlyrealcuzzo22 hours ago | root | parent

> Frontier labs have their own variants of MLA

Yes, variants typically 2-3x less good...

Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.

loading story #48313854

#visit	13,437,452
#session	74,665
#live-session	0