https://github.com/anthropics/claude-code/issues?q=is%3Aissu...
Apparently whatever SWE-bench is measuring isn't very relevant.
I don’t doubt they have found interesting security holes, the question is how they actually found them.
This System Card is just a sales whitepaper and just confirms what that “leak” from a week or so ago implied.
I suspect it's going to be used to train/distill lighter models. The exciting part for me is the improvement in those lighter models.
Looks like they just built a way larger model, with the same quirks than Claude 4. Seems like a super expensive "Claude 4.7" model.
I have no doubts that Google and OpenAI already done that for internal (or even government) usage.
pick one or more: comically huge model, test time scaling at 10e12W, benchmark overfit