Hacker News new | past | comments | ask | show | jobs | submit
Because it doesn't work like how you think at all. You're still thinking it works like Chain of Thought. It doesn't. And the difference is key!

It works by introducing probabilistic noise, and exploring N paths fully (each with noise) in parallel (all compressed).

It's reasoning at a much, much smaller (probabilistic) level than running everything through the expensive large model (deterministic) and sometimes catching that it said, "I think 1.12 is greater than 1.9 because 12 is bigger than 9, final answer".

The easiest way to think about it is: if you understand how hyper words work, it's as if it's searching for different versions of the hyper words that probilisticslly would lead to better outcomes IF it fed them to the LLM before it even does.

That's not actually how it works exactly. But I think it is close enough to be helpful to understand where the gain is, a rough idea of what's happening (searching paths), and how it can potentially have huge orders of magnitude improvements (doing so without paying the full price of exploring the paths through the expensive and huge model).

And also why it is so much harder to determine what it's "thinking".

If you aren't familiar with hyper words, this is an amazing series: https://youtu.be/eMlx5fFNoYc?si=49KHjn5IrVtyyaFq

The general idea is that a token is a multidimensional vector to represent a word -> think like "man" is a [noun, singular, English, pronoun, masculine, contemporary, ...]. Each time is sees a new word, it mutates this word to mean some new token (often never before seen), that means something. That's how it can roll-up a 1M line context into a shorter context, and somehow keep most of the meaning. Because it mutates all the words into different words that individually mean nothing, but when put next to each other represent the thing you likely want to do, that the LLM can somehow make sense of.

Similarly, GRAM operates entirely in a latent space that doesn't mean anything to us, but it's able to predict N different full paths WITHOUT actually exploring them fully through the LLM before it sends the one it "thinks" is best to the LLM.

If you understand how hyper words work, you can understand the noise injection... It's like it's saying, if instead of the user saying "The quick round fox" it said "The quick brown fox" -> I could probably give a response that's more like the answer they want. It's obviously far more sophisticated in the ways it can help than just a simple typo.

Something may have pushed a hyper word for "man" to somehow become a lot more like "woman", and GRAM allows it to look at the different hyper words and say... Hmm... Maybe if I changed this one gender dimension over here on this one word, maybe the entire outcome would be dramatically better. Let's try it!

Standard models compute these "hyper words" internally but immediately decode them into human language text tokens to form a Chain of Thought. Once decoded into a rigid real word, the multidimensional nuance of the continuous vector is lost!

Hyper words are the exact thing that make LLMs able to actually be smart! They can add so much more meaning to a word than a human ever could imagine - try to put 10,000 dimensions on the word "the"... Forcing them to decode them back into our dumb, un-contextualized, rudimentary language and losing all the valuable information they have - just so we can inspect it - OBVIOUSLY makes them enormously less intelligent!

It's like if we forced your eyeballs to turn everything it saw into words, before feeding it to your optic nerves, just so your optic nerves could check that you didn't see something harmful, before they sent the words to your brain... Instead of just sending light signals directly.