Hacker News new | past | comments | ask | show | jobs | submit
The table comparing eval scores shows the following:

Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%

Then, when you scroll all the way down to the bottom Footnotes section it says

"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."

Seems reasonable? Presumably Claude also performs better under the Claude Code harness.
Why not state that?
{"deleted":true,"id":48319287,"parent":48317853,"time":1780031284,"type":"comment"}