Story Detail of id 48396456 | Liveview Hacker News

They are not even close in capabilities. Only nenchmark I ever seen that captures their difference is DeepSWE. They are worse by factor of 3.

Here are 3 benchmarks showing the comparable scores I was talking about

Wait, the only benchmark you found? It looks like you never heard of confirmation bias before. https://en.wikipedia.org/wiki/Confirmation_bias