Reported benchmarks:
swe-bench verified mythos 5: 95.5%; fable 5: 95.0%
swe-bench pro mythos 5: 80.3%; fable 5: 80.0%
terminal-bench 2.1 mythos 5: 88.0%; fable 5: 84.3%
gpqa diamond mythos 5: 94.1%
riemannbench mythos 5: 55.0%; mythos preview: 43.0%; opus 4.8: 34.0%
arxivmath mythos 5: 78.5%
critpt mythos 5: 28.6%; gpt-5.5: 27.1%; opus 4.8: 20.9%
graphwalks bfs 1m mythos 5: 79.4%; mythos preview: 74.3%; opus 4.8: 68.1%
humanity’s last exam mythos 5: 59.0% without tools; 64.5% with tools
browsecomp mythos 5: 88.0% single-agent; 93.3% multi-agent
osworld-verified mythos/fable: 85.0%
gdp.pdf fable 5: 29.8% strict pass; mythos 5: 87.6% with tools on mean criteria pass
officeqa pro fable 5: 57.9% on databricks’ eval
legal agent benchmark mythos 5: 16.91% all-pass; 92.0% mean criterion-pass
healthbench mythos 5: 62.7%
healthbench professional mythos 5: 66.0%
multilingual gmmlu / milu / include 93.2%; 92.9%; 90.5%
biomysterybench 83.9% human-solvable; 46.1% human-difficult
organic chemistry mythos 5: 90.1%
labbench2 patent questions mythos 5: 79.8%
In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.
Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model.
(From the model card document)
I didn't previously understand that they interpreted "Using Claude to develop competing models" so broadly. I thought that meant something like "our ToS disallow distilling our models."
Too bad. I'll continue to use Claude for now, because it's quite effective, but in the long term I don't want powerful models like these to be controlled by any one nation or company.
But at the same time, it's quite funny because they seem high on their own supply. The recent communiques from claude do not pass objectivity check.
And if Opus 4.6 -> Opus 4.7 -> Opus 4.8 is anything to go by, not sure if there are any value to their "acceleration"
If any company wishes to partner with Anthropic (eg. to get access to Mythos), they need to make sure all public facing comms are vetted by Anthropic's product marketing team, and in almost all the cases I've seen Anthropic's team has edited these comms to be entirely Anthropic first.
Does this imply that they're actively using it for their frontier development and that it's very effective?
As if being in any of these two somehow means that you won't use the models to say, steal random people's money.
Sam Bankman-Fried or Elizabeth Holmes would have been the members of Glasswings project, if not one of the initial members. Who's to say we don't have similar people with access to Mythos right now?