Story Detail of id 48390796 | Liveview Hacker News

aesthesia1 day ago | on: Gemma 4 12B: A unified, encoder-free multimodal model

Audio is 1 dimensional so the usual RoPE position encoding should handle it like it does for text tokens. You only need extra position encoding for higher-dimensional stuff like images.

#visit	13,571,875
#session	74,665
#live-session	0