And even if we were to agree that that's a reasonable standard, GPT 5 shouldn't be included. There is only one datapoint for all OpenAI models. That data point more indicative of the performance of OpenAI models (and the harness used) than of any progression. Once you exclude it it matches what you would expect from a logistic model. Improvements have slowed down, but not stopped
1: https://metr.org/assets/images/many-swe-bench-passing-prs-wo...
If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.
That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.