Hacker News new | past | comments | ask | show | jobs | submit

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

https://mimo.xiaomi.com/blog/mimo-tilert-1000tps
loading story #48447994
loading story #48447479
These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.
loading story #48447397
loading story #48447333
loading story #48447704
Given that MiMo is as cheap as Deepseek ( previous discussion: https://news.ycombinator.com/item?id=48282814 ) multiplying that by 3x for ultra speed is still shockingly cheap.
I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.

Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.

loading story #48447279
loading story #48447361
loading story #48447285
loading story #48447318
loading story #48447535
loading story #48447376
Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.
The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs.

The Xiaomi team really brought something to the table.

I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.
loading story #48447385
loading story #48447233
Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.

I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.

loading story #48447160
loading story #48447792
loading story #48447835
loading story #48447648
loading story #48447277
loading story #48447495
With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput
loading story #48447520
loading story #48447867
Yeah, this seems to be the easiest path for overall agents efficiency in the short term
42B active params, sliding window attention. There's your tradeoff.
loading story #48447139
loading story #48447312
loading story #48447309
{"deleted":true,"id":48447039,"parent":48446639,"time":1780934249,"type":"comment"}
Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.
If MiMo v2.5 Pro can run at >1000tk/s on GPUs then I will soon expect the same from OpenAI/Anthropic/Google.
I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.
I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.
loading story #48447201
loading story #48447353
loading story #48447287
loading story #48447186
loading story #48447196
loading story #48447141