That's an interesting claim, but I don't see it in my own work. They have got better but it's very hard to quantify. I just find myself editing their work much less these days (currently using GPT 5.4).
Without meaning to sound dismissive, because I'm really not intending to, there's also the possibility that you've gotten worse after enough time using them. You're treating yourself as a constant in this, but man cannot walk in the same river twice.
loading story #47349820
loading story #47350666
The problem with evals is the underlying rubric will always be either subjective, or a quantitative score based on something that is likely now baked into the training set directly.
You kind of have to go on "feels" for a lot of this.