Story Detail of id 46872274 | Liveview Hacker News

postalcoder10 hours ago | on: #46871907

Folks have run comparisons. From a huggingface employee:

  codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.

  I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.

https://xcancel.com/ben_burtenshaw/status/200023306951767675...

That said, it's not a perfect comparison because of the Codex model mismatch between runs.

The author seems to be doing a lot of work on skills evaluation.

https://github.com/huggingface/upskill

loading story #46872614

loading story #46872429

loading story #46872371

loading story #46872526

#visit	12,607,744
#session	74,663
#live-session	0