Hacker News new | past | comments | ask | show | jobs | submit
They are not even close in capabilities. Only nenchmark I ever seen that captures their difference is DeepSWE. They are worse by factor of 3.
Wait, the only benchmark you found? It looks like you never heard of confirmation bias before. https://en.wikipedia.org/wiki/Confirmation_bias