Story Detail of id 48387294 | Liveview Hacker News

mchinen1 day ago | on: Gemma 4 12B: A unified, encoder-free multimodal model

Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.

#visit	13,571,754
#session	74,665
#live-session	0