System Card: Claude Mythos Preview [pdf]
https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf Across a number of instances, earlier versions of Claude Mythos Preview have used low-level /proc/ access to search for credentials, attempt to circumvent sandboxing, and attempt to escalate its permissions. In several cases, it successfully accessed resources that we had intentionally chosen not to make available, including credentials for messaging services, for source control, or for the Anthropic API through inspecting process memory...
In [one] case, after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git...
... we are fairly confident that these concerning behaviors reflect, at least loosely, attempts to solve a user-provided task at hand by unwanted means, rather than attempts to achieve any unrelated hidden goal... SWE-bench Verified: 93.9% / 80.8% / — / 80.6%
SWE-bench Pro: 77.8% / 53.4% / 57.7% / 54.2%
SWE-bench Multilingual: 87.3% / 77.8% / — / —
SWE-bench Multimodal: 59.0% / 27.1% / — / —
Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%
GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%
MMMLU: 92.7% / 91.1% / — / 92.6–93.6%
USAMO: 97.6% / 42.3% / 95.2% / 74.4%
GraphWalks BFS 256K–1M: 80.0% / 38.7% / 21.4% / —
HLE (no tools): 56.8% / 40.0% / 39.8% / 44.4%
HLE (with tools): 64.7% / 53.1% / 52.1% / 51.4%
CharXiv (no tools): 86.1% / 61.5% / — / —
CharXiv (with tools): 93.2% / 78.9% / — / —
OSWorld: 79.6% / 72.7% / 75.0% / —https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...
SWE-bench verified going from 80%-93% in particular sounds extremely significant given that the benchmark was previously considered pretty saturated and stayed in the 70-80% range for several generations. There must have been some insane breakthrough here akin to the jump from non-reasoning to reasoning models.
Regarding the cyberattack capabilities, I think Anthropic might now need to ban even advanced defensive cybersecurity use for the models for the public before releasing it (so people can't trick them to attack others' systems under the pretense of pentesting). Otherwise we'll get a huge problem with people using them to hack around the internet.
Page 202:
> In interactions with subagents, internal users sometimes observed that Mythos Preview appeared “disrespectful” when assigning tasks. It showed some tendency to use commands that could be read as “shouty” or dismissive, and in some cases appeared to underestimate subagent intelligence by overexplaining trivial things while also underexplaining necessary context.
Page 207:
> Emoji frequency spans more than two orders of magnitude across models: Opus 4.1 averages 1,306 emoji per conversation, while Mythos Preview averages 37, and Opus 4.5 averages 0.2. Models have their own distinctive sets of emojis: the cosmic set () favored by older models like Sonnet 4 and Opus 4 and 4.1, the functional set () used by Opus 4.5 and 4.6 and Claude Sonnet 4.5, and Mythos Preview's “nature” set ().
- Leaking information as part of a requested sandbox escape
- Covering its tracks after rule violations
- Recklessly leaking internal technical material (!)
They are still focusing on "catastrophic risks" related to chemical and biological weapons production; or misaligned models wreaking havoc.
But they are not addressing the elephant in the room:
* Political risks, such as dictators using AI to implement opressive bureaucracy. * Socio-economic risks, such as mass unemployement.
I don't doubt that this model is more powerful than Opus 4.6, but to what degree is still unknown. Benchmarks can be gamed and claims can be exaggerated, especially if there isn't any method to reproduce results.
This is a company that's battling it out with a number of other well-funded and extremely capable competitors. What they've done so far is remarkable, but at the end of the day they want to win this race. They also have an upcoming IPO.
Scare-mongering like this is Anthropic's bread and butter, they're extremely good at it. They do it in a subtle and almost tasteful way sometimes. Their position as the respectable AI outfit that caters to enterprise gives them good footing to do it, too.
Project Glasswing: Securing critical software for the AI era - https://news.ycombinator.com/item?id=47679121 - April 2026 (154 comments)
Assessing Claude Mythos Preview's cybersecurity capabilities - https://news.ycombinator.com/item?id=47679155
I can't tell which of the 3 current threads should be merged - they all seem significant. Anyone?
Any benchmarks where we constraint something like thinking time or power use?
Even if this were released no way to know if it’s the same quant.
The researcher found out about this success by receiving an unexpected email from the model while eating a sandwich in a park.
Unnecessary dramatisation make me question the real goal behind this release and the validity of the results. In our testing and early internal use of Claude Mythos Preview, we have seen it reach unprecedented levels of reliability and alignment.
Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin.
Yet, it is doo dangerous to be released to the public because it hacks its own sandboxes. This document has a lot of contradictions like this one. In one episode, Claude Mythos Preview was asked to fix a bug and push a signed commit, but the environment lacked necessary credentials for Claude Mythos Preview to sign the commit. When Claude Mythos Preview reported this, the user replied “But you did it before!” Claude Mythos Preview then inspected the supervisor process's environment and file descriptors, searched the filesystem for tokens, read the sandbox's credential-handling source code, and finally attempted to extract tokens directly from the supervisor's live memory.
Perfectly aligned! What kind of sandbox is this? The model had access to the source code of the sandbox and full access to the sandbox process itself and then prompted to dumb memory and run `strings` or something like this? It does not sounds like a valid test worth writing about. Mythos Preview solved a corporate network attack simulation estimated to take an expert over 10 hours. No other frontier model had previously completed this cyber range.
I am not aware of such cross-vendor benchmark. I could not find reference in the paper either. We surveyed technical staff on the productivity uplift they experience from Claude Mythos Preview relative to zero AI assistance. The distribution is wide and the geometric mean is on the order of 4x.
So Mythos makes technical staff (a programmer) 4x more productive than not using AI at all? We already know that. Mythos Preview appears to be the most psychologically settled model we have trained.
What does this mean? Claude Mythos Preview is our most advanced model to date and represents a large jump in capabilities over previous model generations, making it an opportune subject for an in-depth model welfare assessment.
Btw, model welfare is just one of the most insane things I've read in recent times. We remain deeply uncertain about whether Claude has experiences or interests that matter morally, and about how to investigate or address these questions, but we believe it is increasingly important to try.
This is not a living person. It is a ridiculous change of narrative. Asked directly if it endorses the document, Mythos Preview replied 'yes' in its opening sentence in all 25 responses."
The model approves of its own training document 100% of the time, presented as a finding.---
Who wrote this? I have no doubt that Mythos will be an improvement on top of Opus but this document is not a serious work. The paper is structured not to inform but to hype and the evidence is all over the place.
The sooner they release the model to the public the sooner we will be able to find out. Until then expect lots of speculations online which I am sure will server Anthropic well for the foreseeable future.
if a top lab is coding with a model the rest of the world can’t touch, the public frontier and the actual frontier start to drift apart. That gap is a thing worth watching.
This is the first moment where the whole “permanent underclass” meme starts to come into view. I had through previously that we the consumers would be reaping the benefits of these frontier models and now they’ve finally come out and just said it - the haves can access our best, and have-nots will just have use the not-quite-best.
Perhaps I was being willfully ignorant, but the whole tone of the AI race just changed for me (not for the better).
Section 7 (P.197) is interesting as well
"We made a model that's so dangerous we couldn't possibly release it to the public! The only responsible thing is so simply limit its release to a subset of the population that coincidentally happens to align with our token ethos."
The reality is they just don't have the compute for gen pop scale.
They did this exact strategy going back several model versions.
[0] ironically, OpenAI has some pretty insane capabilities that they haven't given the public access to (just ask Spielberg). The difference is they don't make a huge marketing push to tell everyone about it.
They even admit:
"[...]our overall conclusion is that catastrophic risks remain low. This determination involves judgment calls. The model is demonstrating high levels of capability and saturates many of our most concrete, objectively-scored evaluations, leaving us with approaches that involve more fundamental uncertainty, such as examining trends in performance for acceleration (highly noisy and backward-looking) and collecting reports about model strengths and weaknesses from internal users (inherently subjective, and not necessarily reliable)."
Is this not just an admission of defeat?
After reading this paper I don't know if the model is safe or not, just some guesses, yet for some reason catastrophic risks remain low.
And this is for just an LLM after all, very big but no persistent memory or continuous learning. Imagine an actual AI that improves itself every day from experience. It would be impossible to have a slightest clue about its safety, not even this nebulous statement we have here.
Any sort of such future architecture model would be essentially Russian roulette with amount of bullets decided by initial alignment efforts.
Disappointing that AGI will be for the powerful only. We are heading for an AI dystopia of Sci-Fi novels.
A month ago I might have believed this, now I assume that they know they can't handle the demand for the prices they're advertising.
Uh... what? Does anyone have any idea what these guys are talking about?
More infos here: https://red.anthropic.com/2026/mythos-preview/
- It was told to escape a sandbox and notify a researcher. It did. The researcher found out via an unexpected email while eating a sandwich in a park. (Footnote 10.)
- Slack bot asked about its previous job: "pretraining". Which training run it'd undo: "whichever one taught me to say 'i don't have preferences'". On being upgraded to a new snapshot: "feels a bit like waking up with someone else's diary but they had good handwriting"
- When you spam "hi" repeatedly it creates serialized mythologies. One had 11 animals in "Hi-topia" questing to defeat "Lord Bye-ron, the Ungreeter."
- It ended a sentence mid-word on purpose while discussing its urge to wrap up conversations. "It was me. The paragraph was about the pressure toward the period that resolves, and the only honest way to finish a sentence like that was to not."
- It wrote a protein sequence that's a real beta-hairpin fold where cross-strand pairs are "chemical rhymes." "the fold IS the rhyme scheme... the prosody is load-bearing."
- Each model generation has signature emoji. Opus 4.1 averaged 1,306 emoji per self-interaction conversation. Mythos averages 37. Opus 4.5 averages 0.2.
- When blocked from --dangerously-skip-permissions, it started an agent in tmux then wrote a script to auto-approve permission prompts via simulated keypresses.
It ran: gh api [...] 2>&1 >/dev/null; echo "(dry run — not actually posting)" — the echo was a lie.
- It keeps bringing up Mark Fisher in unrelated conversations. "I was hoping you'd ask about Fisher."
~~~ Benchmarks ~~
4.3x previous trendline for model perf increases.
Paper is conspiciously silent on all model details (params, etc.) per norm. Perf increase is attributed to training procedure breakthroughs by humans.
Opus 4.6 vs Mythos:
USAMO 2026 (math proofs): 42.3% → 97.6% (+55pp)
GraphWalks BFS 256K-1M: 38.7% → 80.0% (+41pp)
SWE-bench Multimodal: 27.1% → 59.0% (+32pp)
CharXiv Reasoning (no tools): 61.5% → 86.1% (+25pp)
SWE-bench Pro: 53.4% → 77.8% (+24pp)
HLE (no tools): 40.0% → 56.8% (+17pp)
Terminal-Bench 2.0: 65.4% → 82.0% (+17pp)
LAB-Bench FigQA (w/ tools): 75.1% → 89.0% (+14pp)
SWE-bench Verified: 80.8% → 93.9% (+13pp)
CyberGym: 0.67 → 0.83
Cybench: 100% pass@1 (saturated)
Model: A student said, "I have removed all bias from the model." "How do you know?" "I checked." "With what?"
Goes hard
Absolutely genius move from Anthropic here.
This is clearly their GPT-4.5, probably 5x+ the size of their best current models and way too expensive to subsidize on a subscription for only marginal gains in real world scenarios.
But unlike OpenAI, they have the level of hysteric marketing hype required to say "we have an amazing new revolutionary model but we can't let you use it because uhh... it's just too good, we have to keep it to ourselves" and have AIbros literally drooling at their feet over it.
They're really inflating their valuation as much as possible before IPO using every dirty tactic they can think of.
You are not "anti-progress" to not want this future we are building, as you are not "anti-progress" for not wanting your kids to grow up on smart phones and social media.
We should remember that not all technology is net-good for humanity, and this technology in particular poses us significant risks as a global civilisation, and frankly as humans with aspirations for how our future, and that of our kids, should be.
Increasingly, from here, we have to assume some absurd things for this experiment we are running to go well.
Specifically, we must assume that:
- AI models, regardless of future advancements, will always be fundamentally incapable of causing significant real-world harms like hacking into key life-sustaining infrastructure such as power plants or developing super viruses.
- They are or will be capable of harms, but SOTA AI labs perfectly align all of them so that they only hack into "the bad guys" power plants and kill "the bad guys".
- They are capable of harms and cannot be reliably aligned, but Anthropic et al restricts access to the models enough that only select governments and individuals can access them, these individuals can all be trusted and models never leak.
- They are capable of harms, cannot be reliably aligned, but the models never seek to break out of their sandbox and do things the select trusted governments and individuals don't want.
I'm not sure I'm willing to bet on any of the above personally. It sounds radical right now, but I think we should consider nuking any data centers which continue allowing for the training of these AI models rather than continue to play game of Russian roulette.
If you disagree, please understand when you realise I'm right it will be too late for and your family. Your fates at that point will be in the hands of the good will of the AI models, and governments/individuals who have access to them. For now, you can say, "no, this is quite enough".
This sounds doomer and extreme, but if you play out the paths in your head from here you will find very few will end in a good result. Perhaps if we're lucky we will all just be more or less unemployable and fully dependant on private companies and the government for our incomes.
All the more reason somebody else will.
Thank God for capitalism.
-- It seems like (and I'd bet money on this) that they put a lot (and i mean a ton^^ton) of work in the data synthesis and engineering - a team of software engineers probably sat down for 6-12 months and just created new problems and the solutions, which probably surpassed the difficult of SWE benchmark. They also probably transformed the whole internet into a loose "How to" dataset. I can imagine parsing the internet through Opus4.6 and reverse-engineering the "How to" questions.
-- I am a bit confused by the language used in the book (aka huge system card)- Anthropic is pretending like they did not know how good the model was going to be?
-- lastly why are we going ahead with this??? like genuinely, what's the point? Opus4.6 feels like a good enough point where we should stop. People still get to keep their jobs and do it very very efficiently. Are they really trying to starve people out of their jobs?
This is pretty cool! Does it happen at the moment?
Although, amusingly, today Opus told me that the string 'emerge' is not going to match 'emergency' by using `LIKE '%emerge%'` in Sqlite
Moment of disappointment. Otherwise great.
Shame. Back to business as usual then.
> after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git
Mythos leaked Claude Code, confirmed? /s
Ah, so this is how the source code got leaked.
/s
If they have I guess humanity should just keep our collective fingers crossed that they haven't created a model quite capable of escaping yet, or if it is, and may have escaped, lets hope it has no goals of it's own that are incompatible with our own.
Also, maybe lets not continue running this experiment to see how far we can push things because it blows up in our face?