> Chain of thought is like trying to improve JPG quality by re-compressing it several times. If it's not there it's not there.
Empirically speaking, I have a set of evals with an objective pass/fail result and a prompt. I'm doing codegen, so I'm using syntax linting, tests passing, etc. to determine success. With chain-of-thought included in the prompting, the evals pass at a significantly higher rate. A lot of research has been done demonstrating the same in various domains.
If chain-of-thought can't improve quality, how do you explain the empirical results which appear to contradict you?
loading story #42010453