Story Detail of id 48447052 | Liveview Hacker News

moffkalast2 hours ago | on: MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

42B active params, sliding window attention. There's your tradeoff.

Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.

loading story #48447437

bearjaws2 hours ago | parent

Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.

#visit	13,658,890
#session	74,665
#live-session	0