Story Detail of id 48391980 | Liveview Hacker News

dgacmu21 hours ago | on: Gemma 4 12B: A unified, encoder-free multimodal model

I was excited about this until I fed it one of my local test problems: coin identification. I then spent 10 minutes arguing with it that a photo of a 1998 washington quarter was not, in fact, a Morgan Silver Dollar. I mean, I wish it was.

It went into a crash loop on a british columbia 1 dollar coin. This happened with both Q4_1 and Q8. Maybe I'm holding it wrong or it's just really bad for this task.

In contrast, gemma4 gets the british columbia coin right though it also mis-identifies the quarter. gemini 3.1-flash-lite nails them both.

Was getting about 50 t/s output on a 3090 with Q8 which seems ok.

sureglymop21 hours ago | parent

Why would you expect it to be good for this particular highly specific task? Curious.

dgacmu20 hours ago | root | parent

Ah! Good question: Google's non-open-weights models (Gemini, etc) have almost always outperformed on image recognition tasks compared to any other models. I use a mix of in-house and Gemini for image classification tasks for $startup. No other models have done as well, and I had hoped that some of that would spill over into their open source models. It does to a degree - bigger Gemma models are okay.

#visit	13,567,493
#session	74,665
#live-session	0