Story Detail of id 48312753 | Liveview Hacker News

minimaltom22 hours ago | on: Claude Opus 4.8

Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.

On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.

onlyrealcuzzo22 hours ago | parent

> Frontier labs have their own variants of MLA

Yes, variants typically 2-3x less good...

Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.

amluto21 hours ago | root | parent

How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?

onlyrealcuzzo21 hours ago | root | parent

It's useful at the local level, where there will be SOTA models developed...

loading story #48314741

#visit	13,438,158
#session	74,665
#live-session	0