Chain-of-thought can hurt performance on tasks where thinking makes humans worse
https://arxiv.org/abs/2410.21333Humans are so smart and do so many decisions and calculations on the subconscious/implicit level and take a lot of mental shortcuts, so that as we try to automate this by following exactly what the process is, we bring a lot of the implicit thinking out on the surface, and that slows everything down. So we've had to be creative about how we build LLM workflows.
We've observed it previously in psychiatry(and modern journalism, but here I digress) but LLMs have made it obvious that grammatically correct, naturally flowing language requires a "world" model of the language and close to nothing of reality, spatial understanding? social clues? common sense logic? or mathematical logic? All optional.
I'd suggest we call the LLM language fundament a "Word Model"(not a typo).
Trying to distil a world model out of the word model. A suitable starting point for a modern remake of Plato's cave.
It's not that these "human tools" for understanding "reality" are superfluous, it's just that they ar second-order concepts. Spatial understandings, social cues, math, etc. Those are all constructs built WITHIN our primary linguistic ideological framing of reality.
To us these are totally different tasks and would actually require totally different kinds of programmers but when one language is another language is everything, the inventions we made to expand the human brain's ability to delve into linguistic reality are no use.
And the random noise in the process could prevent it from ever being useful, or it could allow it to find a hyper-efficient clever way to apply cross-language transfer learning to allow a 1->1 mapping of your perfectly descriptive prompt to equivalent ASM....but just this one time.
There is no way to know where performance per parameter plateaus; or appears to on a projection, or actually does... or will, or deceitful appears to... to our mocking dismay.
As we are currently hoping to throw power at it (we fed it all the data), I sure hope it is not the last one.
Thus, Large Word Model (LWM) would be more precise, following his argument.
> In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts.
In other words, the issue they're identifying is that COT is an less effective model for some tasks compared to unmodified chat completion, not just that it slows everything down.
1) Everything
For the purpose of AGI, LLM are starting to look like a local maximum.
I've been saying it since they started popping off last year and everyone was getting euphoric about them. I'm basically a layman - a pretty good programmer and software engineer, and took a statistics and AI class 13 years ago in university. That said, it just seems so extremely obvious to me that these things are likely not the way to AGI. They're not reasoning systems. They don't work with axioms. They don't model reality. They don't really do anything. They just generate stochastic output from the probabilities of symbols appearing in a particular order in a given corpus.
It continues to astound me how much money is being dumped into these things.
That's how we think. We think sequentially. As I'm writing this, I'm deciding the next few words to type based on my last few.
Blows my mind that people don't see the parallels to human thought. Our thoughts don't arrive fully formed as a god-given answer. We're constantly deciding the next thing to think, the next word to say, the next thing to focus on. Yes, it's statistical. Yes, it's based on our existing neural weights. Why are you so much more dismissive of that when it's in silicon?
Remember the resounding euphoria at the LK-99 paper last year, and how everyone suddenly became an expert on superconductors? It's clear that we've collectively learned nothing from that fiasco.
The idea of progress itself has turned into a religious cult, and what's worse, "progress" here is defined to mean "whatever we read about in 1950s science fiction".
Maybe in our society there's a surprising amount of value of a "word stirrer" intelligence. Sure, if it was confident when it was right and hesitant when it was wrong it'd be much better. Maybe humans are confidently wrong often enough that an artificial version that's compendious experience to draw on is groundbreaking.
Human brains are sure big but they are inefficient because a big portion of the brain is going to non-intelligence stuff like running the body internal organs, eye vision, etc…
I do agree that the money is not well spent. They should haver recognized that we are hitting s local maximum with the current model and funding should be going to academic/theoretical instead of dump brute force.
Arguably a second regression, the first being cost, because COT improves performance by scaling up the amount of compute used at inference time instead of training time. The promise of LLMs was that you do expensive training once and then run the model cheaply forever, but now we're talking about expensive training followed by expensive inference every time you run the model.
A regression that humans also face, and we don't say therefore that it is impossible to improve human performance by having them think longer or work together in groups, we say that there are pitfalls. This is a paper saying that LLMs don't exhibit superhuman performance.