Hacker News new | past | comments | ask | show | jobs | submit

Gemma 4 12B: A unified, encoder-free multimodal model

https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/
I ran the Q4 quant (used with llama.cpp) though my "minesweeper" vibe-coding benchmark: https://senko.net/vibecode-bench/2026/minesweeper-gamma-4-12...

The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.

So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)

I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.

To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).

Lists of various models I tested: https://senko.net/vibecode-bench/

It was almost certainly not trained for coding, as it's got both audio and vision input, is only 12B, and nowhere in the announcement is coding mentioned. It will likely not have good performance on coding in general, compared to other small models like Qwen 3.6 35B A3B, Gemma 4 26B A4B, Nvidia Nemotron 3 Nano 30B-A3B, gpt-oss-20b.

For 16GB laptops, Qwen 3.5 9B is the undisputed champ.

Gemma 4 31B is the top dog at small model coding, but is dense so it needs ~48GB unified RAM for full context. If you want decent coding on a laptop you need a lot of RAM. But this shouldn't be surprising, dev machines have always needed lots of resources.

loading story #48394339
loading story #48391585
loading story #48399183
loading story #48398291
loading story #48390606
loading story #48392909
loading story #48390570
loading story #48391722
loading story #48399288
> It roughly compares with GPT-4.1 (!!), released 14 months ago

I think the mayor win for coding was reasoning. That's why such a small model can match GPT-4.1 in coding, but I suspect that GPT-4.1 still wins in general world knowledge due to bigger size.

loading story #48390207
>consumer-grade card with 12G of VRAM and got 5t/s

That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.

loading story #48390536
Thank you for sharing this. Do you think the syntactical issues could be addressed with fine tuning or some other kind of parameter tweaking? That's frustrating hah.
loading story #48388743
Models this small and this capable bode really well for the usefulness of a PC like the RTX Spark that Nvidia/Microsoft announced this week. 128GB of unified memory will likely be more than sufficient for effective local agentic coding, even if SOTA cloud models will still be even better.

Up until this point, I've found the cost/value to unequivocally favor using a cloud subscription, but I would be lying if I didn't worry that one day OpenAI is going to increase the price for my subscription by 5-10x. I rely on these tools enough that if there is a real viable local option, I'm going to take it.

loading story #48391775
loading story #48394436
loading story #48391893
> my consumer-grade card with 12G of VRAM and got 5t/s for output

Thank you for giving me hope!

loading story #48395620
The big story here is the encoder-free part, which I still don't fully understand.

> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.

That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...

> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.

Embedded within that developer page is a good explainer of the encoder free architecture . https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
loading story #48395924
loading story #48388179
This is just early fusion basically.

FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818

I've been waiting for something like this to be released since then.

The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).

I don't think it's the same. It's a similar concept, but Gemma is using just a linear projection, which I assume is a lot faster. The developer guide has more details: https://developers.googleblog.com/gemma-4-12b-the-developer-...

    Vision embedder (35M parameters): Replaces the 27 vision transformer layers of the other medium-sized Gemma 4 models. Raw 48x48 pixel patches are projected to the LLM hidden dimension with a single matmul. A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input
the "single matmul" is the key here, I haven't tried it, but it's probably pretty fast and memory efficient.
Some of the FAIR people moved to Thinky, and they also started doing encoder-free MM-LLMs. Now Google. This seems to becoming a trend working at small scale, but the difficult part is scaling.

Standard approach for training MM-LLMs is we train the encoder first, there are O(2-10B) good images on the internet, so encoder needs to see each image O(10-100) times, that is O(100T) tokens, which is more than the entire pre-training budget for most runs. That is the reason we train the encoder separately (smaller model, 2B active vs 30B or 200B active LLM); there is nothing magical about training the encoder and LLM together, it is just more token-efficient to train the image modality first.

I would contend that the actual big story is the gallery app:

https://developers.google.com/edge/gallery

Anyone with a 16GB Mac — that is quite a lot of journalists, surely — can download that, install a model into it, and play.

Surely journalists have to start asking questions at least about OpenAI's consumer revenue projections now.

I am a major, major AI cynic, but I decided to be an informed cynic so I've been playing with local models for agentic work and a bit of CAD-to-image generation. I really quite like the 26B Gemma model — I've been using it to teach myself some fundamental things and learn OpenCode without developing a cloud dependency. It writes fairly good code and it is helping me learn the things I want to learn at a pace that I prefer.

But if this 12B model is even half as close as they say it is, this casts some doubt on the consumer end of the cloud business model, at least in the short term.

(Not clear if this app is using the MTP drafters; I've still not got them working with Gemma myself, though the Qwen 3.6 built-in MTP support is super in LM Studio)

loading story #48388180
loading story #48392230
Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.
loading story #48386256
> quantization

12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?

But TBD how well the base model performs before thinking too much about quantization

loading story #48390243
The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.

> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

loading story #48387006
One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.
loading story #48388663
loading story #48386966
I don't think we've bottomed out on what we can do with embedding models. They're these tiny models that absolutely rip on modern cpus with 8 bit int optimizations. Like in my app we can say pretty definitive things about hundreds of millions of places in the world on retrieval tasks on regular hardware.
I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.
Encoder free is huge for running on SBCs etc. often the encoding time is a significant fraction of generation time if you are using a VLM as a all purpose vision model
Either Google changed the text or you editorialised it a tiny bit - just for all others that got excited, they mean 16GB VRAM. So a premium graphics card requiring a >2500€ device is the minimum to run this.

Still progress, but not quite democratic yet.

Weird though that Google might be cannibalising it's own AI subscription service?

loading story #48390653
loading story #48388801
loading story #48390548
It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.
I dont see how encoder free audio isnt a mistake here. a mimo model will at least get the audio to 12.5 Hz as opposed to the 25 Hz they are doing. and you dont need to finetune mimo either.
There are many priors to encoder-free VLMs. I specifically remember the EVE series of models from ~2 years.

https://github.com/baaivision/EVE

> That's technically encoding

Isn't that just projecting the patches into the d_model size vectors that the models takes?

>I am assuming that involves of quantization

12B model in 16GB seems very reasonable to me, int8 is top quality for running models.

loading story #48386281
loading story #48387644
VRAM, not RAM. I wish it was light enough for iGPUs too
loading story #48393995
Well its a real simple encoder I guess
We are now entering the closed loop game. Google doesn't need anyone else to accelerate their models. This is their bread and butter.

I'm both shocked but also not surprised that they continue to develop such efficiencies. Honestly it's like silicon and CPU architecture advancement. We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time. Maybe there's some point of decay but essentially the next 30 years will be more advanced than the last 30 and were going to be living in some sort of futurist blade runner scenario where gene editing is repairing ageing cells, organs and curing all sorts of cancers that haven't even appeared yet. Beyond our lifetimes people will live to 125 quite steadily and with great mobility and then obviously people will look to how do we get to living 1000 years, which of anyone is religious knows Noah and others lived to that age in a totally different era.

Anyway I'm going off on some tangent but look back 30 years. Now look forward 30 years. It's going to be insane. May God protect us.

> We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time.

It's definitely an exciting time, but in terms of advancements in the state of the art, there is a lot of low-hanging fruit left to pick. There IS a bottom, however, as you can only encode so much "knowledge" in a small number of parameters.

This feels to me a lot like what the early days of what radio or aviation must have been like. Or, heck, microcomputers even.

loading story #48388719
1996 didn't look that different than today, in the US anyway. Biggest difference, besides the electric cars, is everybody has a phone but nobody uses it to talk to people.
> May God protect us.

Today, data systems and algorithms can be deployed at unprecedented scale and speed. Unintended consequences will affect people with that same scale and speed

Michael Chapman

Yes I've taken the "must optimise longevity" route, taking priority over other things such as my career and hobbies. I want to see the future - all this AI stuff fascinates me.
> which of anyone is religious knows Noah and others lived to that age in a totally different era

My favourite conspiracy theory lately is that the above isn't a silly fairy tale, that we actually used to live much much longer -- until the common cold came on the scene, and the sequelae dramatically shortened our lifespans. Today we dismiss it as "just a cold" unbeknownst of what it robbed us from.

loading story #48396539
Nope, lol.

Large models still are quite far ahead, don't be fooled that even Gemma:31b (which is better than the 12b overall) is anywhere close to big models.

There is definitely room for optimization, but fundamentally, for complex tasks, you need visible small gradients for accuracy that allow the model to be trained on (and consequently be followed during inference). For example, if you specify in instructions not to write code but ask coding question, Gemma will still write code. Whereas Gemini/Claude will pick up on that and follow your instructions better.

loading story #48391969
What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?

Is it simply goodwill and/or marketing? Or am I missing something strategic?

A big part of the frontier labs abilities to charge 80% gross margins on inference is having the cornered resource of frontier models.

If that inference becomes popular and valuable enough that those companies make billions of dollars in profit, those companies could use that profit to fund the building of alternative products and platforms that dis-intermediate google's relationship with the customer.

Google already has an 80% gross margin business, the biggest one in the world. Everybody wants a slice of it.

By offering frontier inference closer to cost and open-sourcing everything that's sub-frontier, they're commoditizing frontier labs' models, which inhibits their ability to durably make high gross margins on inference.

It's a strategic play.

loading story #48386816
loading story #48390706
This won't replace commercially viable, revenue generating alternatives of their own devising, but it does enable development activity and initiate conversations with enterprises who start with this model but want to do slightly more.

That's my experience right now... my company is all in on a plethora of platform products. Also, Microsoft just yesterday said their goal was "Unmetered intelligence". There's a lot of things that can be enabled by small local models, and those things are part of stacks that can generate revenue in other layers.

loading story #48387377
Android and Chrome need on-device AI capabilities. Google can't lock down those weights like it can with server-side ML.

So it's easier to just release those models as open source and make it official, since someone would inevitably hack the weights out anyway.

loading story #48386583
loading story #48387308
loading story #48388762
If you're an AI lab, you definitely want research teams in this space - as this is where you can most easily iterate and make improvements which you'll then bake into larger, frontier models.

The question is: do you want to release your models, or use them purely for R&D?

Since everyone else is already releasing models of similar qualities, it's hard to say you're shooting yourself in the foot if you join the chorus.

The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.

loading story #48387429
Google is one of the few verticalized options in AI: Data, models, cloud services, low-level silicon (TPUs), internal use cases, retail use cases, B2B uses, distribution (browser & mobile), etc.

They rise with the tide of AI adoption. But they gain ground if people opt into Google solutions. And any token sent to a Google model (free or paid) actively punishes their competitors that are then required to spend vast sums to remain bleeding edge.

Neutering OpenAI and Anthropic would be my guess. Commoditized LLMs won't hurt Google nearly as much as it hurts the LLM-only companies, and so accelerating the inevitable just helps knock out potential future competition in areas where Google -does- make a lot of money now.
loading story #48386900
As long as Chinese firms are releasing good open models I imagine there isn't a huge downside for Google to release state of the art small models to compete in the "free" space.
Demis at YCombinator said that they think its best their edge models are open cause once they are put on device they are vulnerable anyways

https://youtu.be/JNyuX1zoOgU?is=PdzCILyi8SP6cfDr

Demis is on record saying they need models on the edge and if they’ll be there they might as well be properly open as they’ll be dumped anyway.
It's to destroy possible footholds for competitors and prevent them from making money in segments that Google doesn't care too much about, but can trivially commoditize.
I think its even more puzzling because you can't even run Gemma 31b on google cloud, they only let you test it with a rate limit. No way (I can find) to actually pay them to use it.

We saw great results in our usecase using google direct. Moved to Openrouter because google wouldn't let us use it beyond a test.

Then Openrouters performance looked worse, not sure if there was a quantized version or something. So we instead looked at Deepseek v4 Flash, and opted to go for that.

This model would probably be great for a super low cost cloud model, would love to use it in the cloud, Google makes you go elsewhere.

loading story #48390870
{"deleted":true,"id":48386515,"parent":48386275,"time":1780505742,"type":"comment"}
A strong business case for Gemma includes fine tuning, adding AI to apps that run in the cloud, strengthening Android, shifting unprofitable small AI compute to devices, and harming competitors. The first two would be done using Google's cloud services due to integration with Gemma. I think Google is currently the best positioned company to profit from AI sales to businesses over the next few years, and Gemma is a critical part of the story.
loading story #48390527
Gemini is a huge team while Gemma is relatively small. They can totally do this at a loss with no ulterior motive.

They remind me a bit of HuggingFace, create something great then make money … maybe.

Isn't Apple about to license some variation of this from google for on-device AI? Maybe it’s their sales pitch to Apple and then they will lock it down.
Google's MO since always has been to release great products or services for free, position themselves high and then abandon them or just find uses for Enterprise sales.

I'm pretty sure they are doing it because they get some research experience by shrinking and improving these models, and because they know that by doing this they get some good PR among the dev community.

loading story #48386612
Maybe they are hedging against a future where local models are just as good as cloud models? Or maybe they can go the Taalas route and start hardcoding Gemma on a chip and hardware manufacturers can use it for local private AI.
{"deleted":true,"id":48388062,"parent":48386275,"time":1780512309,"type":"comment"}
They're trying to capture the segment of the market that wants to control the model, with the intent of getting you to run them on Vertex.
My guess is testing for Apple’s Siri replacement and partnership but that’s a total SWAG
Marketing + Pro Serv if I had to take a guess.
The complete Chinese worldwide domination in this sector would be the alternative, since nobody else is releasing anything meaningful.

Plus every open model undermines their local competition by furthering open research and reduces moats, especially since Gemini as a frontier model isn't really competitive with GPT nor Claude for most applications.

Competition from Chinese alternatives hopefully forces more openness and efficient models. DeepSeek for example is nearly on par and far more resource efficient, good for the planet imo
On-device, e.g. Android.
Evangelism for AI. Google is one of the big AI providers.

Eventually the local model is not enough, and you'll upgrade to the big ones.

Gemma overtakes and kills real open-source AI projects, pushing people who would support them towards enterprises like Google
Its image processing is terrible. I ran several tests against it against Qwen 3.5 0.8b (yes, 7% the size) and Qwen beat it every time with Gemma often getting things entirely wrong. I even gave it a plain image saying "This is a test" and it thought for 6 minutes trying to analyze it and failed. Qwen 3.5 0.8b confidently got it in under a second.

It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.

For Qwen 3.5 0.8B presumably you're running it unquantized, because it's so small. Get at least the Q8 of Gemma 4 12B with the F32 mmproj and use an f16 kv cache.

Then run it with the latest llama.cpp that contains the Gemma 4 12B unified bug fixes, using --image-min-tokens 560 --image-max-tokens 2240 --batch-size 4096 --ubatch-size 4096 --temp 1.0 --top-p 0.95 --top-k 64 --jinja

It's understanding far more complex things for me and can reliably handle tiny text, so it should be easily understanding an image that only contains the text "This is a test".

That sounds like a bug. They're very common for open model releases on the first day. If I wasn't on mobile I'd try it on Google's own app.
Sounds like you're doing it wrong, to be honest.
I guess Google implements more / stronger guard rails than Alibaba and thus confuses these small models. At least this was my impression with Gemma3 models where it often said that the image contains some nudity / sex scenes and therefore it cannot give a description of the image. Never understood the point of this behavior....
loading story #48390234
I've always found the Gemma models to vastly under-perform on vision tasks compared to Qwen so that's nothing new.
loading story #48390440
loading story #48399086
Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!

A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.

What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?
I have vLLM running on a Linux machine in my basement, connected with Tailscale, and I use small models as part of tasks like this:

- Transcribing scanned documents into formatted text

- Captioning/describing images and classifying them for audience suitability (includes anti-spam)

- Matching documents with relevant Wikipedia pages for tagging

I don't use them like frontier models. I break the work down into micro-tasks with one clear goal for each prompt. I write a lot of glue software to make the complete flow work. I was working on all of these tasks before LLMs appeared on the scene. The LLMs have allowed me to replace a lot of complicated code with less code plus a model, while achieving better results.

I use local models for reasons of cost and control. I already had the workstation and GPU. The only running cost is electricity. I have used proprietary models from OpenAI and Google for some of these tasks, but I also encountered churn when the models I built my tools around were retired. I don't worry about that when I have the weights saved locally.

I've got a home-built dictation app that uses a local model to clear up the text and fix grammar. It was super easy to build. I’m extending it to capture meeting notes and summarise too. All on-device.

I saw a little app the other day, I think someone posted on here, that looks at your screenshot and renames the file based off the contents of the file.

There's tons of little examples like that. For a lot of use cases, you really don't need the frontier models.

loading story #48394356
I think small models have a very good niche for specific tasks. I utilise a fine tuned Phi-4 model (smaller than this one) that fits in about 3.5gb of RAM (not vram) for the document processing side of things for the desktop app I develop (a bit of a shameless plug - whistle-enterprise.com).

If you have a very specific idea for local model use you can find a way to make it work very well, you don't even need to have a graphics card or NPU chip. You just have to be extremely constrained in how it's used. I think as a generic chatbot they're not great, I'd use a hosted SOTA model and I'm a big fan of local LLMs myself.

loading story #48387909
I use small models like Gemma to improve transcriptions from ASR models amongst other micro-tasks. I actually built out a fine-tuning whisper pipeline with all local (smaller) models meaning no cloud/big-tech co is able to train/sell my (private) data.

Repo is https://github.com/Rebreda/listenr - mainly geared toward Whisper fine-tuning, AMD hardware and local inference

I don't know about this model, but the next one up, the 31B I've been using as an agentic coding assistant in OpenCode, and basically anything that's easy enough that I'd trust Sonnet to handle, I trust Gemma 4 to handle and it's been doing a great job, it surprises me positively much more often than negatively. I not infrequently run into situations where Gemma 4 fails to do the task and I switch to Opus 4.7 and it fails also.
In theory, locally you'd use these where lossiness is acceptable for audio transcription and image labeling (as simple examples).

In practice I haven't got around to building something around multimodality since I'm primarily using their text generation capabilities.

"Small" models are the ones I can run myself on my own terms. LLMs aren't useful enough for me to justify spending hundreds of euros on a GPU with 16GB VRAM or something, and that's assuming I have the rest of the desktop just laying around. Back when I checked (before the RAM price hike), these models weren't meaningfully better than 4-8GB ones anyway, you'd have to go for the top tier cards at 24 or 32 GB iirc to get something vaguely in the direction of the SaaS versions, and that was absolutely out of my budget. Even if that changed, so have hardware prices so it'd probably still work out the same
I use them for research on new features. If my feature is going to interact with a frontier language model in prod, I start with these free local ones which are all competent enough to produce structured output, make tool calls, interact with mcp etc. I don’t care much for the content at the early phase of engineering, I care about the schema & failure modes.

Then when I’m getting close to feature-complete, I’ll move to a hosted frontier model for the final integration.

Cost savings are enormous if you’re making dozens of calls to language models a minute.

I've used Gemma for reviewing and categorizing my writing online over several years (~5 million words across a forum for an OSS project I work on, HN, reddit, etc.), experimenting with training LoRAs (again, on my own writing, since I don't have to worry about ethically sourcing the data if it's all mine), and I'm currently using it to perform web searches and extract data about a specific type of business. It's plenty smart to use a web search MCP to find all the businesses of the right type in a given city, read their website, extract business address, phone number, etc. among other things, and de-dupe and cross-check other sources.

I found Gemma 4 to be better, or at least more nuanced, than Gemini 2.5 Flash. And, the new Gemini 3.5 Flash is very good but is unrealistically expensive (ten times more expensive than DeepSeek or MiMo). So, since I don't need extremely fast performance, a self-hosted Gemma 4 wins for a bunch of stuff.

I've also found Qwen 3.6 27B to be shockingly good at finding security bugs for its size. It beats several larger models, and is close to Gemini Pro 3.1 (but Gemini 3.5 Flash surprisingly beats it soundly). Since it only costs electricity, and my electricity is cheap and 100% renewable, I can use it more broadly than I might otherwise use a hosted model.

All that said, the smart money is still on buying the subsidized tokens from the providers that offer them, rather than buying the hardware needed to run models that are 30+GB in size, as all of the ones I've been using regularly are (8-bit quantization, as they get a little dumber for every bit you drop below that). A $100 subscription to Claude or Codex currently provides access to the best models at a heavily discounted rate. And, DeepSeek/MiMo are extremely cheap, one or more orders of magnitude cheaper than the top models from Anthropic or OpenAI, if you need an API for automated usage. I spent about $4000 on my two inference machines (a Strix Halo with 128GB unified RAM, and a new desktop build based around two cheap old 32GB AMD data center GPUs), which buys a lot of tokens for tiny models like this...probably a couple/few years worth. But, I like tinkering, so having an excuse to play with hardware is its own reward. If it happens to pay me back some of that money, that's a bonus.

Of course, as the major providers decide they need to ring the cash register and stop burning money on subsidized tokens, that math may change, and I may find I'm grateful to have already bought this stuff before the RAM prices made everything 2-3x more expensive.

But, I think if you're not interested in learning about the technology and doing your own training experiments and such, you should probably not try to run stuff locally most of the time.

loading story #48390757
Yes, all my emails gyer sorted out by a finetuned gemma. There are turned into images passes to the model, as multimodal is so practical.
I've yet to see someone answer a question like this with a decent, useful answer.
I moreso run other small special purpose models like Whisper, SAM, Matcha, CLIP etc. and then do contextual correction passes with models like this.

Think almost like unix pipelines, have used it for many workflows.

This is one https://post.bot/
loading story #48388237
loading story #48390798
I really like the idea of small models that you can get the most out of. If I weren't a programmer, I wouldn't even know what I would use Opus 4.8 or GPT 5.5 models for.
Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.
I dunno, feels a bit unfair to companies that actually do FOSS releases (Gemma 4 being released under Apache 2.0 license) to compare them to a company that never done any FOSS releases, and mostly done proprietary "available to download" releases.
loading story #48386425
IDK this model release is a bit disappointing considering the community has been chomping at the bit for the 124ba4b model. There was some leaked info about it but people suspect it was not released because it was too close to gemini flash in performance.
Every other Google model I have tried felt very weak compared to qwen models. I dont have a ton of use case for multimodal though, so its very possible this is a fantastic multimodal model.
loading story #48386722
loading story #48388977
loading story #48388476
This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.
Strange that they are feeding raw audio in. Even in humans, there is a hardware transform to the frequency domain (the cochlea) before data is fed to the brain, effectively doing this part in the LLM seems inefficient.
I don't understand why Google does this. If I can run this locally, why would I need a subscription or use any inference provider, including Google..?

Scorched earth tactics to make anthropic and openai IPO fail?

loading story #48397611
Last time I tried Gemma 4 (26B-A4B) its memory usage would balloon and consume all of my swap until my machine died.

Qwen 3.6 on the other hand barely uses any memory at all for its KV cache.

Turns out when you block people from the best and biggest hardware, they get innovative. It reminds me of the Pentium days when everyone was shipping inefficient programs because the processor would be better next year.
loading story #48390868
What quantisation do the creators intend this to be run at? They talk about 16GB of ram, so should it be run at 8 bit? People here are talking about using q4, but I would have thought a smaller model like this wouldn't perform well at such low bits per parameter. Edit, it looks like their bechmarks would have been done at 16 bit float, as the hugging face release is that size: https://huggingface.co/google/gemma-4-12B . Which is a little deceptive: they're advertising an 8 bit size will fit on 16GB laptops, while releasing a 16bit size.

I guess we have to wait for someone to produce perplexity curves at different Q's.

They haven't made one for this new model, but Unsloth has a comprehensive quant KLD map of Gemma 4 26B A4B here: https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-p...
I was excited about this until I fed it one of my local test problems: coin identification. I then spent 10 minutes arguing with it that a photo of a 1998 washington quarter was not, in fact, a Morgan Silver Dollar. I mean, I wish it was.

It went into a crash loop on a british columbia 1 dollar coin. This happened with both Q4_1 and Q8. Maybe I'm holding it wrong or it's just really bad for this task.

In contrast, gemma4 gets the british columbia coin right though it also mis-identifies the quarter. gemini 3.1-flash-lite nails them both.

Was getting about 50 t/s output on a 3090 with Q8 which seems ok.

Why would you expect it to be good for this particular highly specific task? Curious.
loading story #48392360
Quickly deployed it to check some benchmarks relevant for German language. These are results for CohereLabs/include-base-44 german only : Gemma 4 12B %61.9

  Gemma 4 26B (a4b MoE)    0.647
  Qwen 3 14B               0.621 
  Gemma 4 12B              0.618
  Ministral 14B 2512       0.604 
  Gemma 3 12B              0.547
The quwen 3 14B vs Gemma 4 12B difference is within random variance they same in some repeat runs they actually got the exact same score. Next step up Gemma 4 31B gets 0.676 on this. Or let in some reasoning Qwen 3 14B (reasoning) 0.676.

I'll run some cheat-proof benchmarks ones tomorrow see if qwen is still on top.

I just ran a short tool use test and it's doing pretty well.
Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?
Since ollama has diverged from llama.cpp, it will take a bit of time for ollama to support multi-modality. If you're using plain llama.cpp it looks like a PR has already merged for this model with vision and audio support:

https://github.com/ggml-org/llama.cpp/pull/24077

loading story #48387002
Just use llama.cpp or Unsloth Studio which wraps it, I don't know why anyone use Ollama anymore.
loading story #48388489
To anybody else wondering: Seems like the models supporting image input are just starting to show up. https://ollama.com/library/gemma4:12b-mlx now shows as supporting it, but curiously the overview on https://ollama.com/library/gemma4/tags still lists it as text only. Cache invalidation remains difficult :)
loading story #48392166
Stop using ollama
Ollama is a shitty project that steals from the open source community, don't use it, use llama.cpp instead.
It’s fascinating for me to see how small language models grow recently in capabilities while still consumer friendly in size to run on their machines
{"deleted":true,"id":48390264,"parent":48388453,"time":1780521727,"type":"comment"}
Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.

[0] https://ollama.com/library/gemma4/tags

Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.

MLX is quite literally macOS-specific technology, for other platforms you want non-MLX.

I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.

Edit: about the actual release on Ollama, if you're on non-Apple hardware you probably want the NVFP4 variant ("gemma4:12b-nvfp4") which was uploaded 45 minutes ago, especially if you're with a recent nvidia GPU.

loading story #48389551
loading story #48386838
MLX is Apple’s own machine learning framework, designed for Apple Silicon: https://opensource.apple.com/projects/mlx/
The non-MLX versions just dropped on Ollama. gemma4:12b-it-q8_0, gemma4:12b-it-bf16, etc.
There's a CUDA backend for MLX now. Not sure about the maturity.
I run gemma-4-26b-bf16 in mtp mode and it runs very smooth, spitting out answers in seconds and outputting text 30x faster than i can read.
The optimal small-model solution, delivering multimodal, reasoning, and coding experiences on affordable hardware that were remarkably close to those of mid-to-large models at the time.
Unfortunately there's no gguf quants of the assistant model yet: https://huggingface.co/models?other=base_model:quantized:goo...
I think MTP Gemma4 support is still WIP https://github.com/ggml-org/llama.cpp/pull/23398 ?
loading story #48388889
loading story #48388916
It seems worse in all aspects to the 26B A4B? I would have thought dense models beat MoE still on many benchmarks?

Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.

"Small enough to run locally with just 16GB of VRAM or unified memory"

With many laptops dropping back down to 8GB because of the memory shortage there's some interesting pressures building in the industry.

A small dense multimodal model with audio support, interesting.

Wait, *Excluding Chinese language.

This is ... curious.

P.S. Where is gemma 4 124b?

Where are the computers we could purchase to run 124b models :’(
loading story #48389032
> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)

https://newsletter.maartengrootendorst.com/p/a-visual-guide-... (in a link from here: https://developers.googleblog.com/gemma-4-12b-the-developer-..., which was linked in the text of the post, but not the linkdump at the end).
My understanding is that early (and most extant) visual language models have a component module (called the image encoder) that transforms images into representations (called embeddings) the model's inner layers can process.

This is often a separate module grafted onto the main model, and further pre-trained (e.g. OpenAI's CLIP, SigLIP used in the Gemma 3 and PaliGemma series).

The image encoder approach has a few problems.

One problem is that many like Gemma 3's encoder have fixed image resolution constraints and inputs must be resized with all the attendant distortions that causes with spatial understanding. However, the Gemma 4 series image encoders overcame this and can handle variable-dimension inputs.

Two, these image encoders are somewhat large (ranging from 300-500M parameters) requiring extra memory and FLOPs to run.

Three, say we need to fine-tune a vision language model, updates to its weights, may affect its understanding of the representations generated by the image encoder if we don't fine-tune both together.

The new Gemma-4-12B replaces the encoder (with its many attention layers and large parameter count) with a simple linear projection to generate the embeddings for images. That reduces the computational requirements and simplifies the input pipelines for image processing.

I don't have any expertise on the topic though and might very well be wrong on some details.

Do Gemma 4 models compete with Gemini 3.1 Flash-Lite? I would assume even the smallest Gemini model would outperform even Gemma 4 31B, but I can't really get a sense of performance or output quality difference.
Gemma 4 31b outperformed Gemini 3.1 Flash-Lite in our app benchmarks (agentic tool use via api in our application as a part of various workflows). But google won't let you pay to use Gemma models, you have to go elsewhere, I think this may be because it would cannabilize Flash-lite.
loading story #48391666
loading story #48388424
Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE
The un-quantized MoE outperforms it.

But between same (V)RAM requirement 4 bit 26B-A3B and 8 bit 12B it's unclear which one will win, especially given one is MoE and the other dense.

All the launch benchmarks are at 16 bit.

why combine audio & image analysis into an llm though, why not allow the user to choose their own audio & image analysis alongside their own llm choice?
Is there a paper on this?

I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.

I wonder how hard it would be to add it back on.

I mean Claude is multimodal on input but not output, why couldn't this also be?
It feels like this would be beneficial to give the model more of a deep understanding of visual knowledge.
"Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory." I wish. I just have 12.
How does it compare with e4b, aside from being larger?
There's a comparison of all the Gemma 4 models (+ Gemma 3 27B) on the Huggingface model card: https://huggingface.co/google/gemma-4-12B-it#benchmark-resul...
That's what I want to know too. A smarter E4B that's happy in opencode would be a good selfhosted model for me
It's quite interesting to see the quants pour into the HF page. I keep refreshing it and see many new quants every few mins.
loading story #48400205
using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away
{"deleted":true,"id":48386333,"parent":48385906,"time":1780504849,"type":"comment"}
Are there qwen or minimax or other open weight models of same hardware requirements that outperform this?
I'm actually thinking how much this is bett3r (besides multimedia) over prismml's 1.5bit model based on qwen2.5 or sth.
I want to like the vision capabilities of the model. However, when I gave it an image which Gemma 26B A4B and Qwen 3.6 35B A3B has no problem correctly describing in detail, including identifying the Taj Mahal in the background it utterly failed. Its sense of the image was that it was a "distorted wide panorama" and even when I asked directly if it was the Taj Mahal it said no. The reference models saw it correctly as a normal square image taken from a fairly rectilinear lens (iPhone main camera).
I have now also tried it on this scatter plot: https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-p...

Similarly, the 26B A4B Gemma 4 and the 35B A3B Qwen 3.6 identify it clearly, give me the title and trends analysis fairly accurately. While this 12B spits out gobbledygook about it having something to do with hard-drive capacity. It's like it can barely see, gets the very broad strokes (knows it's looking at some kind of chart), but can't identify any details clearly.

How does this compare to frontier models?
I don’t see the download in lm studio
It also says it is supposed to be available in their own Edge Gallery app and it’s not there (on iOS).
Just tried this out. Jesus Christ. Google does some things so well.
I mean, they did invent the technology. It's actually kinda surprising they're not the leader in the space. They kinda got Kodaked (though the story is still playing out, and I guess they're still somewhat competitive in the space even if Anthropic and OpenAI are the leaders).
{"deleted":true,"id":48390232,"parent":48390097,"time":1780521569,"type":"comment"}
good one, wanna try on Cerebras inference in Agentic Browsing
I'm seeing very low quality results on LMStudio with this model. Worse than Gemma 3 12B.

It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.

I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.

Not terribly impressed with this one. I asked it for recommendation between Paris to Berlin and option 3 was Rome... and option 4 was Tokyo.

mmmkay.

I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.
Is there some place where we can try it before downloading the gigabytes of weights?
Asked it to name the director who wears a rolex and likes submarines. It said christopher nolan.
{"deleted":true,"id":48386743,"parent":48385906,"time":1780506812,"type":"comment"}
I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?
This comment has me a bit confused.

Consumers were complaining about the standard 8GB with the early 2020 refresh of MacBook Pros, many OSes ago. Sure, it might be workable for many tasks (as evidenced by the recent sales of the MacBook Neo), but users with a mere 8GB shouldn't have expectations of LLM performance. Even 16GB feels like a stretch.

loading story #48386998
loading story #48387153
They already provide E2B and E4B that run on (much) smaller devices, including tablets and phones. This fills the gap in the middle. The bigger Gemma 4 models are excellent for their size, but at 8-bit quantization they need about 64GB of VRAM or unified memory. 48GB for 6-bit. Any lower quantization than that, they start to get notably dumber. So, a 12B is interesting for that middle ground.
I have 24 gb unified memory so it’s a good model for me
Surely they must know the current hurdles, but clearly they know that all the relevant people are monitoring the market for the proper hardware to get and 16GB will be an entry point.
{"deleted":true,"id":48388122,"parent":48386614,"time":1780512594,"type":"comment"}
{"deleted":true,"id":48386638,"parent":48386614,"time":1780506382,"type":"comment"}