Story Detail of id 47477552 | Liveview Hacker News

tarruda7 hours ago | on: Flash-MoE: Running a 397B Parameter Model on a Laptop

Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

    mmlu: 87.86%

    gpqa diamond: 82.32%

    gsm8k: 86.43%

    ifeval: 75.90%

More details of my experience:

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

Overall an excellent model to have for offline inference.

Aurornis7 hours ago | parent | next

The method in this link is already using a 2-bit quant. They also reduced the number of experts per token from 10 to 4 which is another layer of quality degradation.

In my experience the 2-bit quants can produce output to short prompts that makes sense but they aren’t useful for doing work with longer sessions.

This project couldn’t even get useful JSON out of the model because it can’t produce the right token for quotes:

> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.

loading story #47478015

loading story #47478647

arjie5 hours ago | parent | next

What's the tok/s you get these days? Does it actually work well when you use more of that context?

By the way, it's been a long time since I last saw your username. You're the guy who launched Neovim! Boy what a success. Definitely the Kickstarter/Bountysource I've been a tiny part of that had the best outcome. I use it every day.

loading story #47479299

outlog5 hours ago | parent | next

What is power usage? maybe https://www.coconut-flavour.com/coconutbattery/ can tell you estimate?

loading story #47478898

loading story #47479637

loading story #47479590

#visit	13,229,580
#session	74,665
#live-session	0