Hacker News new | past | comments | ask | show | jobs | submit

Chain-of-thought can hurt performance on tasks where thinking makes humans worse

https://arxiv.org/abs/2410.21333
This is so uncannily close to the problems we're encountering at Pioneer, trying to make human+LLM workflows in high stakes / high complexity situations.

Humans are so smart and do so many decisions and calculations on the subconscious/implicit level and take a lot of mental shortcuts, so that as we try to automate this by following exactly what the process is, we bring a lot of the implicit thinking out on the surface, and that slows everything down. So we've had to be creative about how we build LLM workflows.

Language seems to be confused with logic or common sense.

We've observed it previously in psychiatry(and modern journalism, but here I digress) but LLMs have made it obvious that grammatically correct, naturally flowing language requires a "world" model of the language and close to nothing of reality, spatial understanding? social clues? common sense logic? or mathematical logic? All optional.

I'd suggest we call the LLM language fundament a "Word Model"(not a typo).

Trying to distil a world model out of the word model. A suitable starting point for a modern remake of Plato's cave.

loading story #42006457
Language is the tool we use to codify a heuristic understanding of reality. The world we interact with daily is not the physical one, but an ideological one constructed out of human ideas from human minds. This is the world we live in and the air we breath is made of our ideas about oxygenation and partly of our concept of being alive.

It's not that these "human tools" for understanding "reality" are superfluous, it's just that they ar second-order concepts. Spatial understandings, social cues, math, etc. Those are all constructs built WITHIN our primary linguistic ideological framing of reality.

To put this in coding terms, why would an LLM use rails to make a project when it could just as quickly produce a project writing directly to the socket.

To us these are totally different tasks and would actually require totally different kinds of programmers but when one language is another language is everything, the inventions we made to expand the human brain's ability to delve into linguistic reality are no use.

I can suggest one reason why LLM might prefer writing in higher level language like Ruby vs assembly. The reason is the same as why physicists and mathematicians like to work with complex numbers using "i" instead of explicit calculation over 4 real numbers. Using "i" allows us to abstract out and forget the trivial details. "i" allows us to compress ideas better. Compression allows for better prediction.
except LLMs are trained on higher level languages. Good luck getting you LLM to write your app entirely in assembly. There just isn’t enough training data.
But in theory, with what training data there IS available on how to write in assembly, combined with the data available on what's required to build an app, shouldn't a REAL AI be able to synthesize the knowledge necessary to write a webapp in assembly? To me, this is the basis for why people criticize LLMs, if something isn't in the data set, it's just not conceivable by the LLM.
Yes. There is just no way of knowing how many more watts of energy it may need to reach that level of abstraction and depth - maybe on more watt, maybe never.

And the random noise in the process could prevent it from ever being useful, or it could allow it to find a hyper-efficient clever way to apply cross-language transfer learning to allow a 1->1 mapping of your perfectly descriptive prompt to equivalent ASM....but just this one time.

There is no way to know where performance per parameter plateaus; or appears to on a projection, or actually does... or will, or deceitful appears to... to our mocking dismay.

As we are currently hoping to throw power at it (we fed it all the data), I sure hope it is not the last one.

There isn't that much training data on reverse engineering Python bytecode, but in my experiments ChatGPT can reconstruct a (unique) Python function's source code from its bytecode with high accuracy. I think it's simulating the language in the way you're describing.
I don’t buy this. My child communicates with me using emotion and other cues because she can’t speak yet. I don’t know much about early humans or other sapiens but I imagine they communicated long before complex language evolved. These other means of communication are not second order, they are first order.
loading story #42007863
loading story #42007874
It’s in the name: Language Model, nothing else.
I think the previous commenter chose "word" instead of "language" to highlight that a grammatically correct, naturally flowing chain of words is not the same as a language.

Thus, Large Word Model (LWM) would be more precise, following his argument.

loading story #42007894
loading story #42007318
loading story #42008064
Bingo, great reply! This is what I've been trying to explain to my wife. LLM's use fancy math and our language examples to reproduce our language but have no thoughts are feelings.
Yes but the initial training sets did have thoughts and feeling behind them and those are reflected back to the user in the output (with errors)
loading story #42007812
loading story #42009166
loading story #42005474
This is a regression in the model's accuracy at certain tasks when using COT, not its speed:

> In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts.

In other words, the issue they're identifying is that COT is an less effective model for some tasks compared to unmodified chat completion, not just that it slows everything down.

loading story #42001519
why are Pioneer doing anything with LLMs? you make AV equipment
loading story #42007450
loading story #42000781
loading story #42001280
loading story #42006742
loading story #41999805
loading story #42016777
loading story #42000370
loading story #42006447
loading story #42001282
loading story #42001413
So, LLMs face a regression on their latest proposed improvement. It's not surprising considering their functional requirements are:

1) Everything

For the purpose of AGI, LLM are starting to look like a local maximum.

>For the purpose of AGI, LLM are starting to look like a local maximum.

I've been saying it since they started popping off last year and everyone was getting euphoric about them. I'm basically a layman - a pretty good programmer and software engineer, and took a statistics and AI class 13 years ago in university. That said, it just seems so extremely obvious to me that these things are likely not the way to AGI. They're not reasoning systems. They don't work with axioms. They don't model reality. They don't really do anything. They just generate stochastic output from the probabilities of symbols appearing in a particular order in a given corpus.

It continues to astound me how much money is being dumped into these things.

How do you know that they don’t do these things? Seems hard to say for sure since it’s hard to explain in human terms what a neural network is doing.
loading story #42002584
loading story #42002031
loading story #42002985
loading story #42006688
If you expect "the right way" to be something _other_ than a system which can generate a reasonable "state + 1" from a "state" - then what exactly do you imagine that entails?

That's how we think. We think sequentially. As I'm writing this, I'm deciding the next few words to type based on my last few.

Blows my mind that people don't see the parallels to human thought. Our thoughts don't arrive fully formed as a god-given answer. We're constantly deciding the next thing to think, the next word to say, the next thing to focus on. Yes, it's statistical. Yes, it's based on our existing neural weights. Why are you so much more dismissive of that when it's in silicon?

loading story #42002188
loading story #42002687
loading story #42003068
I totally agree that they’re a local maximum and they don’t seem like a path to AGI. But they’re definitely kinda reasoning systems, in the sense that they can somewhat reason about things. The whacky process they use to get there doesn’t take away from that IMO
> I've been saying it since they started popping off last year and everyone was getting euphoric about them.

Remember the resounding euphoria at the LK-99 paper last year, and how everyone suddenly became an expert on superconductors? It's clear that we've collectively learned nothing from that fiasco.

The idea of progress itself has turned into a religious cult, and what's worse, "progress" here is defined to mean "whatever we read about in 1950s science fiction".

> It continues to astound me how much money is being dumped into these things.

Maybe in our society there's a surprising amount of value of a "word stirrer" intelligence. Sure, if it was confident when it was right and hesitant when it was wrong it'd be much better. Maybe humans are confidently wrong often enough that an artificial version that's compendious experience to draw on is groundbreaking.

I am pretty sure Claude 3.5 Sonnet can reason or did reason with a particular snippet of code I was working on. I am not an expert in this area but my guessing is that these neural nets (made for language prediction) are being used for reasoning. But that’s not their optimal behavior (since they are token predictor). A big jump in reasoning will happen when reasoning is off loaded to an LRM.

Human brains are sure big but they are inefficient because a big portion of the brain is going to non-intelligence stuff like running the body internal organs, eye vision, etc…

I do agree that the money is not well spent. They should haver recognized that we are hitting s local maximum with the current model and funding should be going to academic/theoretical instead of dump brute force.

> So, LLMs face a regression on their latest proposed improvement.

Arguably a second regression, the first being cost, because COT improves performance by scaling up the amount of compute used at inference time instead of training time. The promise of LLMs was that you do expensive training once and then run the model cheaply forever, but now we're talking about expensive training followed by expensive inference every time you run the model.

loading story #42002562
> So, LLMs face a regression on their latest proposed improvement.

A regression that humans also face, and we don't say therefore that it is impossible to improve human performance by having them think longer or work together in groups, we say that there are pitfalls. This is a paper saying that LLMs don't exhibit superhuman performance.

LLMs are a local maximum in the same way that ball bearings can't fly. LLM-like engines will almost certainly be components of an eventual agi-level machine.
loading story #42008815
loading story #42002605
loading story #42002565
loading story #42002544
loading story #42009996
loading story #42001669
loading story #42000848
loading story #42000344
loading story #42001199