Story Detail of id 48391488 | Liveview Hacker News

easygenes22 hours ago | on: Gemma 4 12B: A unified, encoder-free multimodal model

I want to like the vision capabilities of the model. However, when I gave it an image which Gemma 26B A4B and Qwen 3.6 35B A3B has no problem correctly describing in detail, including identifying the Taj Mahal in the background it utterly failed. Its sense of the image was that it was a "distorted wide panorama" and even when I asked directly if it was the Taj Mahal it said no. The reference models saw it correctly as a normal square image taken from a fairly rectilinear lens (iPhone main camera).

easygenes22 hours ago | parent

I have now also tried it on this scatter plot: https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-p...

Similarly, the 26B A4B Gemma 4 and the 35B A3B Qwen 3.6 identify it clearly, give me the title and trends analysis fairly accurately. While this 12B spits out gobbledygook about it having something to do with hard-drive capacity. It's like it can barely see, gets the very broad strokes (knows it's looking at some kind of chart), but can't identify any details clearly.

#visit	13,568,537
#session	74,665
#live-session	0