That said, my personal hypothesis is that AGI will emerge from video generation models rather than text generation models. A model that takes an arbitrary real-time video input feed and must predict the next, say, 60 seconds of video would have to have a deep understanding of the universe, humanity, language, culture, physics, humor, laughter, problem solving, etc. This pushes the fidelity of both input and output far beyond anything that can be expressed in text, but also creates extraordinarily high computational barriers.
And what I'm saying is that I find that argument to be incredibly weak. I've seen it time and time again, and honestly at this point just feels like a "humans should be a hundred feet tall based on on their rate of change in their early years" argument.
While I've also been amazed at the past progress in LLMs, I don't see any reason to expect that rate will continue in the future. What I do see the more and more I use the SOTA models is fundamental limitations in what LLMs are capable of.