Hacker News new | past | comments | ask | show | jobs | submit
It doesn't for me. I use Fable to make plans, then give them to GPT 5.5 to review, and it always finds flaws and edge cases that Fable misses (some are really critical). It was the same with Opus 4.8. I'll admit it finds a bit fewer issues now, but Fable feels more like an incremental improvement than a major generation ahead.
For that test you have to compare letting a fresh agent (subagent) or the same model do the same review.

The fact that a review helps does not prove the model choice for the review.

You reviewing your own writing helps too!

This is exactly what I find too, I make plans in both models and compare them in the other model. And Claude usually agrees (65-80% of the time) that the Codex plan included things it didn't think of, or was better in some other way.

Note, this is better than it was with Opus, where it was more like 90% of the time the Codex plans were obviously better.