That assumes that the agent knows which one is better. And to bake in which one is better via post-training would require a study like this to establish where each one works well
I’ve got a custom ultra high performance streaming semantic search I exposed as a tool and the RL bias in Claude is almost insurmountable without copious and consistent steering. Codex will follow instructions and use the tools I ask it to but for gods sake between Claude asking to take a nap because it’s getting late in the session and it regressing to RL biased tools like grep it’s maddening. When I can get it to use my compositional tools tool calls drop from like 20-50 to 3-4, but it’s almost impossible to steer.