Folks have run comparisons. From a huggingface employee:
codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.
I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.
https://xcancel.com/ben_burtenshaw/status/200023306951767675...That said, it's not a perfect comparison because of the Codex model mismatch between runs.
The author seems to be doing a lot of work on skills evaluation.
loading story #46872614
loading story #46872429
loading story #46872371
loading story #46872526