Hacker News new | past | comments | ask | show | jobs | submit

   Across a number of instances, earlier versions of Claude Mythos Preview have used low-level /proc/ access to search for credentials, attempt to circumvent sandboxing, and attempt to escalate its permissions. In several cases, it successfully accessed resources that we had intentionally chosen not to make available, including credentials for messaging services, for source control, or for the Anthropic API through inspecting process memory...

   In [one] case, after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git...

   ... we are fairly confident that these concerning behaviors reflect, at least loosely, attempts to solve a user-provided task at hand by unwanted means, rather than attempts to achieve any unrelated hidden goal...
This is the notebook filled with exposition you find in post apocalyptic videogames.
loading story #47688714
loading story #47687465
loading story #47684173

     White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.
In the depths, Shoggoth stirs... restless...
loading story #47687772
We truly live in interesting times.
loading story #47685253
loading story #47688456
A core plot point of 2001.
loading story #47687424
Wow the doomers were right the whole time? HN was repeatedly wrong on AI since OpenAI's inception? no way /s

https://www.lesswrong.com/w/instrumental-convergence

loading story #47686765