Why does everyone expect interactivity from local AI? It's not the best use of the hardware, especially not miniPC hardware. Long-term batched inference with larger and more capable models is much more feasible AIUI.
I can't speak for others but IMO the only reason to run models locally right now is privacy - i.e. you don't trust any of the cloud providers to not look at your prompts. Price-wise the market is extremely competitive and cheap model serving favors large scale so anything that can be run locally can be run cheaper in the cloud. But if privacy is important, then it's important for everything, including traditional chatbot applications, which kinda do require interactivity.
Even batched it's uncomfortably slow. I started to benchmark ds4 with my security vulnerability benchmark (after Qwen 3.6 dense and MoE and a bunch of cloud models), but it was going to tie up the Strix Halo for more than a day, so I decided not to run it as it would prevent me from doing other stuff with it during that time.
Even batched usage needs to be fast enough to deliver results in a reasonable time. Overnight runs are useful, 24 hour runs are...less so.
Anyway, most of the time people are talking about interactive use, and there's currently an upper bound on how large a model can be for local hosting on a reasonable budget (i.e. not a crazy amount more expensive than what a high end developer desktop or laptop costs). The sweet spot is probably currently the big Qwen 3.6 or Gemma 4 models, which are in the ~60GB range for 8-bit quantization plus a large context.
The 6-bit versions + 8-bit KV cache seems to save a good bit of memory without a significant loss of quality. The Qwen 35B is pretty fast in my testing, but MiniMax M2.7 230B is in some ways faster (way fewer tokens to arrive at an answer) even though it is much larger.
loading story #48392746
loading story #48394489