Hacker News new | past | comments | ask | show | jobs | submit
Somehow the internet has also forgot that cheating to get ahead in China is basically a norm and expected behavior.
American labs also use gamed and cherry-picked benchmarks extensively. Anthropic used them in their Fable announcement and avoided DeepSWE because it doesn't beat GPT-5.5 in that one. Google's numbers for Gemini 3.5 Flash recently did not at all line up with people's subjective experience using these models, and this also happened with Gemini 3.1 Pro before it.

Everybody has incentives to manipulate benchmark results to show their models in the best light.