Hacker News new | past | comments | ask | show | jobs | submit
42B active params, sliding window attention. There's your tradeoff.
Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.
loading story #48447437
Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.