Story Detail of id 48386647 | Liveview Hacker News

georgehm1 day ago | on: Gemma 4 12B: A unified, encoder-free multimodal model

Embedded within that developer page is a good explainer of the encoder free architecture . https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

amelius13 hours ago | parent | next

I skimmed it, but I still wonder why (1) we still need a tokenizer for text, and (2) why the other modalities (audio/video) don't need one.

sigmoid108 hours ago | root | parent

How do you think the other modalities are fed into the attention layers? The other modalities are tokenized as well, that's literally what these separate image/audio encoders created as output before feeding it into the main network. Tokenization is at its core just a tradeoff between sequence length and embedding size, so it will probably stay relevant as long as attention layers scale quadratically with sequence length.

asim1 day ago | parent

That's a great explainer, thanks for sharing it.

#visit	13,568,147
#session	74,665
#live-session	0