the last?!? I'm excited to see :) I'll take the other side of that since llms are so new
Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.
And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.
I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.
Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.
What have YOU thought of that Claude can't do?
I think what gp said was the improvements are incremental, and we haven't seen a big revolutionary change since 2-3 years, and the pace is slowing down.
If benchmarks across the board keep trending up and you still don't notice a difference, that's not evidence the model stopped improving. More likely your tasks aren't hard enough to expose the gains, or the model has passed the point where you're able to judge it.
You can only tell a good answer from a great one up to your own ceiling. Once the model clears that, both look the same to you, and the extra capability is real whether or not you can see it.
Would Opus 10 release tomorrow and be nearly AGI, I still would still use it like 4.7 because on daily use, I am the limit (also the harness).
So as a customer paying for tokens, I’m probably going to search for better cost rather than more intelligence.
Friend does marine autopilots in C++ on 64kb of memory. It's totally useless for him.
From my experience any sort of more difficult backend logic - all LLMs fail pretty quick. Especially when you need to logically work out the business logic (partly if not mostly because it just doesn't have the context you do).
One idea is that maybe it could figure out how many L's are in the word "google" [1]
Or, maybe which days of the week have a "d" in their spelling [2].
So Claude has no excuses here.
Edit: even Qwen 3.6 27B handles it ( https://i.imgur.com/jleJxj2.png ), and of course Claude does. I had to go all the way back to Opus 3 to get it to fail (https://i.imgur.com/uJOH2nP.png).