Chain-of-thought can hurt performance on tasks where thinking makes humans worse

371benocodes | 1 month ago | 250 | HN

This is so uncannily close to the problems we're encountering at Pioneer, trying to make human+LLM workflows in high stakes / high complexity situations.

Humans are so smart and do so many decisions and calculations on the subconscious/implicit level and take a lot of mental shortcuts, so that as we try to automate this by following exactly what the process is, we bring a lot of the implicit thinking out on the surface, and that slows everything down. So we've had to be creative about how we build LLM workflows.

haccount1 month ago | parent | next

Language seems to be confused with logic or common sense.

We've observed it previously in psychiatry(and modern journalism, but here I digress) but LLMs have made it obvious that grammatically correct, naturally flowing language requires a "world" model of the language and close to nothing of reality, spatial understanding? social clues? common sense logic? or mathematical logic? All optional.

I'd suggest we call the LLM language fundament a "Word Model"(not a typo).

Trying to distil a world model out of the word model. A suitable starting point for a modern remake of Plato's cave.

loading story #42006457

elif1 month ago | root | parent | next

Language is the tool we use to codify a heuristic understanding of reality. The world we interact with daily is not the physical one, but an ideological one constructed out of human ideas from human minds. This is the world we live in and the air we breath is made of our ideas about oxygenation and partly of our concept of being alive.

It's not that these "human tools" for understanding "reality" are superfluous, it's just that they ar second-order concepts. Spatial understandings, social cues, math, etc. Those are all constructs built WITHIN our primary linguistic ideological framing of reality.

elif1 month ago | root | parent | next

To put this in coding terms, why would an LLM use rails to make a project when it could just as quickly produce a project writing directly to the socket.

To us these are totally different tasks and would actually require totally different kinds of programmers but when one language is another language is everything, the inventions we made to expand the human brain's ability to delve into linguistic reality are no use.

jumping_frog1 month ago | root | parent | next

I can suggest one reason why LLM might prefer writing in higher level language like Ruby vs assembly. The reason is the same as why physicists and mathematicians like to work with complex numbers using "i" instead of explicit calculation over 4 real numbers. Using "i" allows us to abstract out and forget the trivial details. "i" allows us to compress ideas better. Compression allows for better prediction.

WD-421 month ago | root | parent

except LLMs are trained on higher level languages. Good luck getting you LLM to write your app entirely in assembly. There just isn’t enough training data.

xienze1 month ago | root | parent

But in theory, with what training data there IS available on how to write in assembly, combined with the data available on what's required to build an app, shouldn't a REAL AI be able to synthesize the knowledge necessary to write a webapp in assembly? To me, this is the basis for why people criticize LLMs, if something isn't in the data set, it's just not conceivable by the LLM.

Jerrrrrrry1 month ago | root | parent | next

Yes. There is just no way of knowing how many more watts of energy it may need to reach that level of abstraction and depth - maybe on more watt, maybe never.

And the random noise in the process could prevent it from ever being useful, or it could allow it to find a hyper-efficient clever way to apply cross-language transfer learning to allow a 1->1 mapping of your perfectly descriptive prompt to equivalent ASM....but just this one time.

There is no way to know where performance per parameter plateaus; or appears to on a projection, or actually does... or will, or deceitful appears to... to our mocking dismay.

As we are currently hoping to throw power at it (we fed it all the data), I sure hope it is not the last one.

cjbprime1 month ago | root | parent

There isn't that much training data on reverse engineering Python bytecode, but in my experiments ChatGPT can reconstruct a (unique) Python function's source code from its bytecode with high accuracy. I think it's simulating the language in the way you're describing.

WD-421 month ago | root | parent

I don’t buy this. My child communicates with me using emotion and other cues because she can’t speak yet. I don’t know much about early humans or other sapiens but I imagine they communicated long before complex language evolved. These other means of communication are not second order, they are first order.

loading story #42007863

loading story #42007874

PedroBatista1 month ago | root | parent | next

It’s in the name: Language Model, nothing else.

eclecticfrank1 month ago | root | parent

I think the previous commenter chose "word" instead of "language" to highlight that a grammatically correct, naturally flowing chain of words is not the same as a language.

Thus, Large Word Model (LWM) would be more precise, following his argument.

loading story #42007894

loading story #42007318

loading story #42008064

kbrisso1 month ago | root | parent | next

Bingo, great reply! This is what I've been trying to explain to my wife. LLM's use fancy math and our language examples to reproduce our language but have no thoughts are feelings.

AdamN1 month ago | root | parent

Yes but the initial training sets did have thoughts and feeling behind them and those are reflected back to the user in the output (with errors)

loading story #42007812

loading story #42009166

loading story #42005474

lolinder1 month ago | parent | next

This is a regression in the model's accuracy at certain tasks when using COT, not its speed:

> In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts.

In other words, the issue they're identifying is that COT is an less effective model for some tasks compared to unmodified chat completion, not just that it slows everything down.

loading story #42001519

13171 month ago | parent

why are Pioneer doing anything with LLMs? you make AV equipment

loading story #42007450

loading story #42000781

loading story #42001280

loading story #42006742

loading story #41999805

loading story #42016777

loading story #42000370

loading story #42006447

loading story #42001282

loading story #42001413

TZubiri1 month ago | parent | next

So, LLMs face a regression on their latest proposed improvement. It's not surprising considering their functional requirements are:

1) Everything

For the purpose of AGI, LLM are starting to look like a local maximum.

rjbwork1 month ago | parent | next

>For the purpose of AGI, LLM are starting to look like a local maximum.

I've been saying it since they started popping off last year and everyone was getting euphoric about them. I'm basically a layman - a pretty good programmer and software engineer, and took a statistics and AI class 13 years ago in university. That said, it just seems so extremely obvious to me that these things are likely not the way to AGI. They're not reasoning systems. They don't work with axioms. They don't model reality. They don't really do anything. They just generate stochastic output from the probabilities of symbols appearing in a particular order in a given corpus.

It continues to astound me how much money is being dumped into these things.

ChadNauseam1 month ago | root | parent | next

How do you know that they don’t do these things? Seems hard to say for sure since it’s hard to explain in human terms what a neural network is doing.

loading story #42002584

loading story #42002031

loading story #42002985

loading story #42006688

alexwebb21 month ago | root | parent | next

If you expect "the right way" to be something _other_ than a system which can generate a reasonable "state + 1" from a "state" - then what exactly do you imagine that entails?

That's how we think. We think sequentially. As I'm writing this, I'm deciding the next few words to type based on my last few.

Blows my mind that people don't see the parallels to human thought. Our thoughts don't arrive fully formed as a god-given answer. We're constantly deciding the next thing to think, the next word to say, the next thing to focus on. Yes, it's statistical. Yes, it's based on our existing neural weights. Why are you so much more dismissive of that when it's in silicon?

loading story #42002188

loading story #42002687

loading story #42003068

chamomeal1 month ago | root | parent | next

I totally agree that they’re a local maximum and they don’t seem like a path to AGI. But they’re definitely kinda reasoning systems, in the sense that they can somewhat reason about things. The whacky process they use to get there doesn’t take away from that IMO

kibwen1 month ago | root | parent | next

> I've been saying it since they started popping off last year and everyone was getting euphoric about them.

Remember the resounding euphoria at the LK-99 paper last year, and how everyone suddenly became an expert on superconductors? It's clear that we've collectively learned nothing from that fiasco.

The idea of progress itself has turned into a religious cult, and what's worse, "progress" here is defined to mean "whatever we read about in 1950s science fiction".

wyldfire1 month ago | root | parent | next

> It continues to astound me how much money is being dumped into these things.

Maybe in our society there's a surprising amount of value of a "word stirrer" intelligence. Sure, if it was confident when it was right and hesitant when it was wrong it'd be much better. Maybe humans are confidently wrong often enough that an artificial version that's compendious experience to draw on is groundbreaking.

csomar1 month ago | root | parent

I am pretty sure Claude 3.5 Sonnet can reason or did reason with a particular snippet of code I was working on. I am not an expert in this area but my guessing is that these neural nets (made for language prediction) are being used for reasoning. But that’s not their optimal behavior (since they are token predictor). A big jump in reasoning will happen when reasoning is off loaded to an LRM.

Human brains are sure big but they are inefficient because a big portion of the brain is going to non-intelligence stuff like running the body internal organs, eye vision, etc…

I do agree that the money is not well spent. They should haver recognized that we are hitting s local maximum with the current model and funding should be going to academic/theoretical instead of dump brute force.

jsheard1 month ago | parent | next

> So, LLMs face a regression on their latest proposed improvement.

Arguably a second regression, the first being cost, because COT improves performance by scaling up the amount of compute used at inference time instead of training time. The promise of LLMs was that you do expensive training once and then run the model cheaply forever, but now we're talking about expensive training followed by expensive inference every time you run the model.

loading story #42002562

pessimizer1 month ago | parent | next

> So, LLMs face a regression on their latest proposed improvement.

A regression that humans also face, and we don't say therefore that it is impossible to improve human performance by having them think longer or work together in groups, we say that there are pitfalls. This is a paper saying that LLMs don't exhibit superhuman performance.

idiotsecant1 month ago | parent

LLMs are a local maximum in the same way that ball bearings can't fly. LLM-like engines will almost certainly be components of an eventual agi-level machine.

loading story #42008815

loading story #42002605

loading story #42002565

loading story #42002544

loading story #42009996

loading story #42001669

loading story #42000848

loading story #42000344

loading story #42001199

#visit	11155948
#session	45005
#live-session	0