- Opus 4.7 xhigh: 5.2%
- Opus 4.8 xhigh: 13.4%
- Fable 5 xhigh: 29.3%
Seems like a huge jump.
1. That estimate could easily be wrong.
2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.
3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.
prior bms relied mostly on unit tests or synthetic judges which are easily benchmaxxed, which leads to nobody trusting benchmarks
we need people manually checking the data for good code quality
this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)
TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.
Nobody would have 800+ billion reasons to lie by commission or omission here.
they aren't married to a particular lab, most of their usage is their in house model i believe
I think it's safe to assume everything AI related is heavily biased until proven otherwise. Just like in pharma.
EDIT: Oh I see, this is the best link for pricing https://platform.claude.com/docs/en/about-claude/pricing
So the price is double across the board...
From their pricing page, Opus 4.8 costs $5 per million input tokens and $25 per million output tokens [1].
[1] https://platform.claude.com/docs/en/about-claude/models/over...
I would have expected Mythos to be much more expensive than just 2x current Opus (which is clearly cheaper to run than original Opus)
Input Price $10/M tokens
Output Price $50/M tokens
Cache Read $1/M tokens
Cache Write $12.50/M tokens
2x Claude Opus 4.8, same as Claude Opus 4.8 (Fast)
Frankly, not even Opus 4.8 would be enough of an incentive to use at that price range (enterprise-wise; would not even bat an eye as a consumer)
But - these $3k-$5k/month/engineer bills are going to start to get attention soon - only question is whether the response is to slow down on the $$$ spending or reduce the # of engineers.
whats the logic in claiming its a borked metric when everything listed is an anthropic model.