Story Detail of id 48391855 | Liveview Hacker News

pseudollm23 hours ago | on: Gemma 4 12B: A unified, encoder-free multimodal model

No there isn't - read the paper. It's just 40msec raw audio samples. Multiplied by one matrix to translate to 3800 input vector. That's it. The next 40 msec are fed in the next transformer input step. Without any positional encoding. Repeat ad infinitum

#visit	13,572,361
#session	74,665
#live-session	0