They are not even close in capabilities. Only nenchmark I ever seen that captures their difference is DeepSWE. They are worse by factor of 3.
Here are 3 benchmarks showing the comparable scores I was talking about
https://openrouter.ai/rankings https://arena.ai/leaderboard/text/coding https://artificialanalysis.ai/
Wait, the only benchmark you found? It looks like you never heard of confirmation bias before. https://en.wikipedia.org/wiki/Confirmation_bias