Flash-MoE: Running a 397B Parameter Model on a Laptop
https://github.com/danveloper/flash-moeI've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:
mmlu: 87.86%
gpqa diamond: 82.32%
gsm8k: 86.43%
ifeval: 75.90%
More details of my experience:- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...
- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...
- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...
Overall an excellent model to have for offline inference.
This is some interesting work, but applying such extreme measures to LLMs to get them to run severely degrades quality. I know he claims negligible quality loss, but in my experience 2-bit quantizations are completely useless for real work. You can get them to respond to prompts, but they lose their intelligence and will go around in circles.
He also shows 5-6 tokens per second. Again that’s impressive for a large model on limited hardware but it’s very slow. Between the severely degraded model abilities and the extremely slow output the 397B result should be considered an attempt at proving something can technically run, not evidence that it can run well and produce output you’d expect from a 397B model.
He even mentions the obvious problems with his changes:
> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.
So right out of the gate this isn’t useful if you want to do anything with it. He could have tried smaller models or less quantizations to get actual useful output from the model, but it wouldn’t look as impressive. It’s honestly getting kind of exhausting to read all of these AI-coded (admitted in the link) and AI-written papers made more for resume building. It would have been interesting to see this work applied to running a useful model that hadn’t been lobotomized instead of applying tricks to get an impressive headline but useless output.
Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.
It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass.
Outside of that the SSD is idling.
Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.
I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.
Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.
Edit: Typos
https://github.com/matt-k-wong/mlx-flash
2 bit quantization lobotomizes the model but is impressive nonetheless! Maybe one day we'll be able to have intelligent 2 bit quants... I wonder.
my version supports - 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility, tested on Mamba2, and lets up the framework for LM Studio integration
I leveraged this work (Credit to Danveloper) and am in the middle of making this work on more practical models and quants. It still uses flash streaming, but done so with a control knob so you can choose how much ram and how little ram to use. In the craziest case, it uses as little ram as possible but is very slow, however, in the balanced case you use some ram and it's much faster.
I designed it around the intelligence dense Nemotron 3 Nano 30B and Nemotron Cascade 2 30B models (which are smaller, more intelligence density) and can run on low end 16GB machines, though you can run arbitrarily large models on larger machines (designed for very low end, but capable of high end).
MacBook Pro M5 Pro (64GB RAM)
>...at 4.4+ tokens/second
That is even when it is using 4-bit quantization and it is still at that speed.
> The entire 209GB model streams from SSD through a custom Metal compute pipeline.
This is my main problem.
If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.
Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.
How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.
I feel like whenever I'm trying to find information on which local models will work on my hardware, I have to overestimate because people don't know how to wait for things.
Also, reading data doesn't cause SSD wear.