Is there some sort of a leaderboard for this test? Like if you'd give each of Opus 4.8 and GPT 5.5 a score out of 100, what would the scores be?
There isn't, as I wasn't going for strictness, more like a playful challenge in the vein of Simon's SVG pelican.
Between the two, Opus 4.8 seems more capable. But, I suspect the harness plays a large role here. It's possible the result would be as good if Codex ran 10+ agents and spent an hour on it.
OpenAI and Anthropic usually fast-follow each other, so I wouldn't be surprised if Codex got the same capability in a couple of days (and even an update to the model), then it'll be a better test.
Sooo, let's say, winging it, vibes-based: 85% for Opus 4.8, 75% for GPT 5.5. Compare with GPT 5.3 (let's say 25%) here: https://senko.net/vibecode-bench/2026/rts-codex-5.3.html