Hacker News new | past | comments | ask | show | jobs | submit
I ran the Q4 quant (used with llama.cpp) though my "minesweeper" vibe-coding benchmark: https://senko.net/vibecode-bench/2026/minesweeper-gamma-4-12...

The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.

So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)

I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.

To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).

Lists of various models I tested: https://senko.net/vibecode-bench/

It was almost certainly not trained for coding, as it's got both audio and vision input, is only 12B, and nowhere in the announcement is coding mentioned. It will likely not have good performance on coding in general, compared to other small models like Qwen 3.6 35B A3B, Gemma 4 26B A4B, Nvidia Nemotron 3 Nano 30B-A3B, gpt-oss-20b.

For 16GB laptops, Qwen 3.5 9B is the undisputed champ.

Gemma 4 31B is the top dog at small model coding, but is dense so it needs ~48GB unified RAM for full context. If you want decent coding on a laptop you need a lot of RAM. But this shouldn't be surprising, dev machines have always needed lots of resources.

> For 16GB laptops, Qwen 3.5 9B is the undisputed champ.

you can run qwen 3.6 35BA3B on a 12-16GB vram gpu and ot works pretty well.

https://www.youtube.com/watch?v=8F_5pdcD3HY&t=1s

even the 27B in some quants can fit.

https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27b...

qwen IMO is far better for coding, esp agentic coding when combined with something like Pi, it comes probably close enough to Sonnet for a lot of use cases.

Gemma family is better for almost all other tasks you'd use a local llm for.

You can run it, however those low quantized models (iQ2, iQ4, Q2) will very likely underperform the 9B versions at Q6/Q8.
loading story #48404921
I want to try a hybrid setup of Gemma 4 E4B with lots of context for general, then Qwen 3.5 9B or larger for coding. Strix Halo set up this weekend, which may enable even larger Qwen models with tons of context.
The larger Gemma models are quite good at PHP. I would not be surprised if that was a training objective — it's one of the more consumer-focussed programming languages. They have very good knowledge of wordpress hooks.

  > For 16GB laptops, Qwen 3.5 9B is the undisputed champ.
You seem like the guy to ask. For a laptop with 12GB VRAM (RTX 5070) and 32 GB system RAM, what is a good multilingual (English, Hebrew, Greek) model for conversing with personal notes in Org mode format? I don't care how long updating the model or rag takes, and even inference can be reasonably slow, but the results of the query as they relate to my personal notes are important. I don't care about general knowledge, for those questions I can use e.g. ChatGPT.

Thanks

Joins us over on Reddit at r/LocalLlaMA to get 10 different opinions on that
loading story #48396337
loading story #48395144
Qwen 3.5 35B A3

Qwen models are always good. The 35B A3 model is a MoE model which means it has higher performance in RAM constrained environments compared to the 27B dense model (which is better at coding).

I don't have experience to rate it's Hebrew or Greek performance but apparently it's not bad.

Any Gemma 4 model, they are great at translations, multilingual
For the biggest languages, Spanish, French, maybe.

For smaller ones like my native Latvian, the output could be confused for good translation from across the room, the words do look like Latvian words. But the quality is Google translate circa 20 years ago, tops.

It could probably do a decent enough translation to English, if all you need is to get the gist of text. But for smaller European language outputs, nothing comes close to Gemini.

loading story #48395771
You may like https://www.llmfit.org/

(not recommendation, I've not used it .. yet)

loading story #48395644
Have you found Gemma 4 31B better than Qwen 3.6 27B Q8? I just started using Qwen + Pi agent and it's great, but "which model works best" is still totally crowdsourced and I was going off of peoples' opinions on reddit. Would love to hear more opinions if people have them.
> Have you found Gemma 4 31B better than Qwen 3.6 27B Q8?

Which quant of Gemma? For coding Qwen seems to be pretty far ahead, but generally Gemma seems to have a "vaster" set of knowledge, but armed with a search tool it doesn't really matter, and Qwen 3.6 been really great for all sorts of tool calling. I mostly do programming and related things though, fwiw.

> I was going off of peoples' opinions on reddit

It's extremely astroturfed all over the place, especially the larger subreddits, and especially the one related to a specific animal in a specific location. It's sad, as early on it was a great resource, but now it's mostly paid posts and a race to the bottom, with lots of piling, and all the knowledgeable people I used to recognize are nowhere to be found.

loading story #48391311
Yes. I'm using Gemma-4 31B (gemma-4-31B-it-assistant.Q4_K_M.gguf) with llama.cpp to attribute quotations throughout chapters of my sci-fi novel. I started with Qwen3, but couldn't get it to work. Qwen3 TTS Voice Design, on the other hand, is incredible (Qwen3-TTS-12Hz-1.7B-VoiceDesign). I'm using both for an audiobook generator that produces a variety of voices.

Screens:

* https://i.ibb.co/TBBV5nJk/kl-01.png (voice design)

* https://i.ibb.co/nNvvKDyV/kl-02.png (quotation attributions)

loading story #48402168
Gemma 4 31B is enormously impressive. You get 1000 requests/day for free on Google's API and another 1000/day off OpenRouter. Only problem is you get 503 like crazy.
loading story #48403691
> nowhere in the announcement is coding mentioned

It's right there in the middle benchmark bar "LiveCode Bench" 72%.

Qwen 3.5 9B is great for coding, but somehow, based on a few hours of subjetive tests, the Gemma 4 12B seems even better.
It does appear to have training for javascript and PHP, from what I can see, and pretty solid knowledge of wordpress and woocommerce. I would guess it has beginner-friendly knowledge of Python, too?

(Though it is gaslighting me about PHP anonymous functions.)

I would not use it to write code (the MoE 26B writes really good PHP), but it appears to have absolutely good enough knowledge to write implementation plans, and I think that could be useful in a sort of agentic coding tutorial environment.

I test these models with simple things. My favourite mini test is asking an AI to write a "last login" tracker facility for wordpress with a sortable admin column, which is trivial code — only a few lines -- but touches on a reasonably deep bit of the WP API. If you ask it to prompt you with clarifying questions, those questions are quite revealing.

It can write the code. Not tested it but I am sure it works. It's not as elegant.

It is not as good at understanding nuanced instructions as either the 26B or the sparse Qwen 3.6. There are concise things you can say in a prompt to Qwen 3.6 that have it draw logical conclusions that fully impress me.

I am more impressed by it than I expected. I reckon this would be quite useful in a tutorial tool.

(I say this as a sort of qualified cynic; I think much of the AI circus is a farce. But if these things are to ever be useful for teaching without making people dependent on some cloud "intelligence tap", this is progress)

Yeah, I agree 24B-36B sizes are better in general.

I don't have unified RAM tho and offloading to CPU is dog slow, which is why I'm interested in 7b-12b models.

I find ram crazy. My thinkpad has 32G of ram, it's a t470 that's nearly a decade old

Why do people with modern laptops have such little amounts of ram?

The ram that’s important for LLMs is gpu-accessible memory, meaning either systems with unified ram or VRAM, the latter of which is tied to the caliber of GPU one has.
Unified memory is soldered to the motherboard and needs to be ordered with the new laptop, for prices that are well above what the equivalent amount of SODIMM would cost.

Fine if work's paying, but for personal devices (that might have been purchased before local models got good), people have what they have.

My job still issues 16GB laptops as standard. You need a business reason to get more. This has been going on since before the price hikes.

I’m a system administrator and I can do my job with no issues at 16GB. Most days 8GB would likely be enough, since I’m just using and abusing other systems anyway.

Java devs at my last job were still running 16GB in 2020. Admittedly that was a while ago. Still not a decade.

Close some Chrome tabs?

8Gb was the standard for a long time (before Apple went Silicon), because from what I understood, is that SDRAM needs to contantly power cycle the memory bus otherwise the bits will fade, and so by having more RAM, your battery would last a little less... this was around the time when 3 hours charge was unheard of, so every little bit helped.

Probably doesn't matter these days with all-day batterys, but now the demand-supply curve is lopsided.

loading story #48404149
loading story #48404246
> It roughly compares with GPT-4.1 (!!), released 14 months ago

I think the mayor win for coding was reasoning. That's why such a small model can match GPT-4.1 in coding, but I suspect that GPT-4.1 still wins in general world knowledge due to bigger size.

> I suspect ... still wins in general world knowledge due to bigger size

Encyclopedic knowledge matters relatively little in perspective, given the expectable future developments: even the more knowledgeable of us will use that knowledge for reasoning and intuition (and we will have absorbed the intellectual keys during our training), but under our professional hat we should in theory be ready to go "I stand corrected" and "more precisely" with the actual data at hand.

I.e.: for the encyclopedic knowledge needed, the /understander/ will have a RAG subsystem and a corpus of knowledge to inquire upon processing queries.

(Corroboration: we can't delirate, and neither can the machine...)

Don't LLMs work on attention though? The closer in their hyperdimensional space you can land your problem to their inherent understand the better they are at understanding your problem domain. RAG loops can be very slow and agents may simply lack the knowledge to use them correctly.
I agree with you in general, but depending on the task I also find that a certain level of encyclopedic knowledge can be very valuable. For example, if you use it for coding, the model will likely not resort to search or RAGs when deciding whether to use a particular package or stack.
A great position to take. Strong opinions, weakly held.
loading story #48400339
>consumer-grade card with 12G of VRAM and got 5t/s

That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.

Good catch. I haven't looked deeply into it. This is with Vulkan backend on Linux which I understand should be roughly comparable to CUDA? Gfx is rtx 3060(ti?).

I should play a bit more with llama.cpp options and see what bappened there. Thanks!

loading story #48391942
Thank you for sharing this. Do you think the syntactical issues could be addressed with fine tuning or some other kind of parameter tweaking? That's frustrating hah.
With a harness you could feed the code to a linter and if there are errors feed that to a model automatically. It’s amazing that the models are good enough that I haven’t bothered doing this
Models this small and this capable bode really well for the usefulness of a PC like the RTX Spark that Nvidia/Microsoft announced this week. 128GB of unified memory will likely be more than sufficient for effective local agentic coding, even if SOTA cloud models will still be even better.

Up until this point, I've found the cost/value to unequivocally favor using a cloud subscription, but I would be lying if I didn't worry that one day OpenAI is going to increase the price for my subscription by 5-10x. I rely on these tools enough that if there is a real viable local option, I'm going to take it.

> usefulness of the RTX Spark

Not really. There's a reason the announcement didn't include ANY benchmark (!) and didn't mention EXACTLY what is the memory bandwidth. It's going to be dog-slow unusable for large models, as tok/sec is basically bandwidth divided by active weights. Rumoured 300GB/s / 30GB active weights (decent model) = 10 tokens per second, which is really slow

Yep, I have a Strix Halo and while it can run models bigger than Qwen 3.6 27b, it's not usable interactively when you do. ds4 patched for ROCm works, but at such a slow speed, it's not usable for coding agents.

The Nvidia boxes have only slightly more memory bandwidth, so I wouldn't expect them to be notably faster. At least not enough to make it useful interactively at that scale.

Why does everyone expect interactivity from local AI? It's not the best use of the hardware, especially not miniPC hardware. Long-term batched inference with larger and more capable models is much more feasible AIUI.
loading story #48393537
loading story #48392098
The RTX/DGX Spark, Mac Ultras with 128GB unified ram are all ~$5k. Its still an expensive toy for rich people, it might as well be an H100 for 99.9% of the population (not devs with high paying jobs, of course).

the value of local models is allowing normal people to access AI without needing to subscribe to cloud services. this is esp imp for the rest of the world where even a 12GB gpu is extremely expensive.

there is no real viable local option that will come even close to Sonnet/Gemini Flash or the cheaper chinese models. Even if your pc costs <$2k you are never going to recoup the hw costs, and the results will be far worse.

I'm using a Strix Halo laptop (~3k, 64GiB) and with Gemma 4 and Qwen 3.6, both at 8 bits, I'm seeing very impressive results.

As a work tool, this is reasonably priced. You can save a bit of money by opting for a non-laptop form factor.

My Framework Desktop with 128GB was about half that. I did luck out by buying before RAM prices went crazy, though.

I'm looking forward to the fallout when the data center bubble bursts. There's a good possibility we'll see a glut of hardware, either on the used market or from manufacturers that no longer have massive orders from OpenAI and the like.

RTX Spark is pretty much the DGX Spark in a laptop form factor, plus some lower-performing chips in the same series to be released later according to rumors. We know quite well how the top-of-the-line chip performs: it's very interesting for some application areas, less so for others.
> my consumer-grade card with 12G of VRAM and got 5t/s for output

Thank you for giving me hope!

We are really getting close to singularity - the pace of LLM improvement is constantly accelerating.
>The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually

Can you instruct it to use a lsp?