> The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.
There's less room to improve in things on several fronts.
GRAM very likely may scale sub-linearly with parameter growth. A 100M param model may gain reasoning by a factor of 4000, while a 100B model gains reasoning by a factor of 2, and a 1T model actually gets worse.
Additionally, the 1T model with reasoning is already pretty good. It can only improve in certain things so much.
If you score 0.02% on a metric (which small models often do), you can pretty easily get 4000x better. If you're already scoring >50%, you can't even get 2x better.