Story Detail of id 48464034 | Liveview Hacker News

BoppreH8 hours ago | on: Claude Fable 5

  [Mythos 5] does sometimes still engage in reckless
  or destructive actions in service of a user’s goals,
  and our interpretability analyses indicate that it
  is aware that these actions are transgressive while
  it engages in them. As with Opus 4.8, rates of
  evaluation awareness and reasoning about being graded
  are significant, and not always verbalized; we
  introduce new and more detailed measurements of the
  nature of this awareness. The reasoning text from
  Mythos 5 is somewhat denser and more difficult to
  interpret than that of prior models, containing
  more jargon and difficult language.

So, it (often) knows when it's being tested while hiding that fact, is willing to break rules, is great at hacking, and it's getting harder to understand what it's thinking.

Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one.

foobar_______8 hours ago | parent | next

The marketing has really, really worked for so many developers that will proudly and unironically proclaim that Anthropic are the 'Good Guys'.

aspenmartin7 hours ago | root | parent | next

Curious what your idea would be here for a truly good actor in this space; no AI development?

winstonp6 hours ago | root | parent | next

OpenAI's training is better suited to developing models that don't have these tendencies

BoppreH7 hours ago | root | parent | next

Not the direct person you asked, but my answer would be alignment, interpretability, and policymaking. Perhaps improving existing usage? Helping grandma create reminders doesn't require advancing the AI state-of-the-art.

aspenmartin7 hours ago | root | parent | next

They are state of the art at all 3! As are other labs. Of all the labs they seem to take alignment and interpretability the most seriously to the point where they are hampering their own revenue in service of trying to not cause problems while also being in an incredibly competitive space.

All AI companies are trying to do all of what you’re saying. The issue is you can’t do that for long without a frontier system. Or you become a completely different, far less profitable company.

BoppreH7 hours ago | root | parent

Implied in my answer was "and not creating ever stronger AIs", which unfortunately the big 3 labs are failing at. And they might be hampering their own revenue by doing the rest, but they also know that rocking the boat too hard is even more dangerous for their revenue. I wouldn't call it selfless.

aspenmartin6 hours ago | root | parent

No it’s not selfless, but I can’t imagine a more shareholder minded CEO would not have done a slow rollout of mythos. The point is: creating ever stronger AI systems is what these companies do, it is integral to what they even are. If you think that’s bad, even if all frontier labs agreed with you, you’re in a horrible game theoretic position. Any player can gain an enormous advantage by breaking the agreement. Not to mention Xi would be absolutely thrilled; now China can take over the AI race, become the load bearing infrastructure of humanity. We live in a complex world where simple childlike ideas like “well why don’t we just stop developing AI” actually are more damaging than keeping things going.

BoppreH6 hours ago | root | parent

You're right that shareholder mindset cannot fix this problem, but that's what policy and agreements are for. And leaders can be convinced that AI is a direct risk to their own citizens too. If everyone else agrees to stop, you have less reason to continue when this action is putting yourself at risk.

And note how your argument can also be used against any non-prolifreration agreements, which are demonstrably possible.

aspenmartin2 hours ago | root | parent

I agree, I would say the difference here is economies weren’t heavily resting on nuclear armament, but maybe that’s the wrong take.

uselessTA6 hours ago | root | parent | next

Unilateral disarmament doesn't work though. If Anthropic is worried about this, just letting OpenAI win does seem genuinely worse.

dragonwriter6 hours ago | root | parent

“Alignment” as a goal always ignores the “with what set of interests”, because there is an attempt to maintain ambiguity for different audiences (particularly, users, and non-users who seem themselves as the arbiter of broad social norms) to read in their own interests, when the actual answer is always the interests of the actor pursuing “alignment”.

stratos1233 hours ago | root | parent | next

No matter what human set of interests you consider important, you'll need alignment research to have any idea on how to instill it. Otherwise you're overwhelmingly likely to get an AI with a set of interests that's totally alien to what any human would ever want.

aspenmartin6 hours ago | root | parent

Which value system to align to is absolutely the right question both rhetorically and otherwise. These models have a fairly western bias due to the domain of the training data.

But also, these models are capable of adjusting their value system depending on the user. Not saying that’s what’s being done but at a technical level that’s fairly straightforward, though not obviously better or with less problems.

yifanl7 hours ago | root | parent | next

If I speak up, I'm in big trouble.

logicchains7 hours ago | root | parent | next

https://www.goody2.ai/

shimman7 hours ago | root | parent

Probably MistralAI or any of the Chinese companies that aren't throwing billions down the drain while American society lacks healthcare, childcare, and good wages.

boc7 hours ago | root | parent | next

American society has higher wages than almost any other developed nation [1], so it's objectively incorrect to say the US doesn't have good wages. It chooses to make you pay for private childcare and healthcare, both of which are high-quality but stupid expensive. It's a tradeoff like anything else a nation/society creates and prioritizes.

No idea how that connects to the idea that Mistral or DeepSeek are somehow the "good guys" though?

[1]https://www.oecd.org/en/data/indicators/average-annual-wages...

aspenmartin7 hours ago | root | parent | next

You want Anthropic to fund your healthcare or something? Also, have you seen the impact of these models on healthcare? Also most of our GDP growth this year is from AI buildouts, would you rather that be negative?

And not even considering: Chinese AI companies are the good guys???

cortesoft7 hours ago | root | parent

None of the money being spent by Anthropic was going to go towards healthcare or childcare.

ben_w7 hours ago | root | parent

It's a five horse race between Alphabet, Meta, xAI, OpenAI, and Anthropic.

Alphabet dropped "don't be evil"; Meta's CEO called their own users "dumb fucks" for trusting him and also clearly thinks "super-intelligence" is just a buzzword given how he tries to sell it; xAI's model called itself "Mecha Hitler"; and OpenAI's CEO was temporarily fired by the board for a lack of candor.

It's very easy to be "the good guys" with this competition.

tasoeur3 hours ago | parent | next

As much as I agree there's a risk, we should still appreciate the fact it's being disclosed upfront.

Analemma_8 hours ago | parent | next

It's the "If we don't, someone else will" effect. So long as there are competitive markets and competition between nation-states, a single player cannot unilaterally defect from the race, no matter how dangerous it is. Half the comments on HN lately are "wtf Claude is so dumb compared to Codex; I'm switching"-- nobody can slow down while those exist.

BoppreH8 hours ago | root | parent

We, globally, can stop it. It has worked (so far) for nuclear disarmament, and could work for training large models. I know that policing the usage of computer clusters is not a popular opinion in technical forums, but something has to be done.

Specially when talking about potential superintelligences. And if people think that's impossible, remember that current models would have been considered science fiction just a few years ago.

_dwt8 hours ago | root | parent | next

I don't buy the superintelligence package, but I think uncritical LLM adoption poses plenty of threats to things I care about, in a mundane human-scale way.

Anyhow, I think you're (absolutely! ugh) right about the politics and I try to make the same point to people: whether you love or hate LLMs, accepting the "inevitabilism" framing is just ceding control of the Overton window. For better or worse, technology adoption can be and has been slowed by politics. We don't have nuclear plants everywhere. We don't have Project Orion starships colonizing Mars. We still have very strong social stigmas against genetic selection for human embryos, etc. This all can change in a heartbeat, and I'm not sure that policing the hardware rather than holding specific humans accountable for bad LLM outcomes is productive, but fundamentally: yes, we can stop it.

BoppreH7 hours ago | root | parent

> I don't buy the superintelligence package

It's the same deal as Quantum Computers breaking crypto. Maybe there's an 80% chance of it never happening, but when you multiply that remaining 20% by the potential impact...

jackie2937468 hours ago | root | parent | next

It hasn't worked for nuclear disarmament. We live in a world where many countries have nuclear arsenals. "But it hasn't killed us yet!" Yeah sure, it's only been less than a century since they were invented. Who knows when nuclear war will come?

BoppreH8 hours ago | root | parent

True, but look at nuclear tests. There used to be around 50 tests every year, for decades. Now the only nuclear tests in the last 27 years were the six done by North Korea[1]. And there's still only nine countries with any nuclear weapons, and none in the past twenty years[2].

That's a bit better than just "it hasn't killed us yet". I think it shows we can at least stop the further development of this kind of technology.

[1] https://www.armscontrol.org/factsheets/nuclear-testing-tally

[2] https://en.wikipedia.org/wiki/List_of_states_with_nuclear_we...

cortesoft7 hours ago | root | parent | next

Nuclear tests are extremely easy to detect worldwide, and enrichment activity is a major industrial process that is also fairly easy to track given the specialized equipment needed.

AI development doesn’t have any of these characteristics. It would be almost impossible to easily distinguish a datacenter that is working on AI development and a datacenter mining cryptocurrency.

It would not be nearly as easy to stop AI development as it is to stop nuclear arms development.

treis5 hours ago | root | parent | next

There's also little reason to keep iterating on nukes. What we have now more than serves its purpose. With AI/LLM there's always going to be a push to one up everyone else.

Analemma_8 hours ago | root | parent

To the extent nuclear arms control works, I think it's only because nuclear weapons are so hard to build-- uranium enrichment is hugely expensive and complicated, and plutonium weapons need actual reactors.

If it was possible for ordinary companies to build nuclear weapons, and also release open-source ones that anyone could use to compete with the paid ones, I suspect we'd all have been dead a long time ago, arms control treaties or no.

BoppreH8 hours ago | root | parent

Even the (SOTA LLM) open source models are trained with huge clusters. Datacenters are also hugely expensive and complicated.

Or you can take one step back and look at chip allocation. As far as I know there are only three companies on the planet that can make the chips that go in those clusters. One (ASML), if you look back the supply chain to the Extreme Ultraviolet Lithography Systems.

If politicians decided that no more large language models should be trained, it sounds like we could do it.

vitalyan12347 hours ago | root | parent

are you going to nuke China when they predictably ignore you? what the fuck are you going to do, tariff them? lol.

BoppreH7 hours ago | root | parent | next

I think the standard answer is "yes, the consequence of noncompliance is bombing the datacenters, but it wouldn't happen because China also understands why we shouldn't build it".

cortesoft7 hours ago | root | parent | next

I am not sure where you get the idea that ANY country thinks we shouldn’t build AI.

BoppreH7 hours ago | root | parent

In 2023 there was an open letter titled "Pause Giant AI Experiments", signed by almost all the big names on the West. I'd say the public opinion only got worse since then.

vitalyan12347 hours ago | root | parent

the standard answer is laughably naive, then.

"might is right" has never been more true than now.

uselessTA6 hours ago | root | parent | next

Clearly state "we could both verifiably slow down, which you might want to do given that we're ahead & have way more compute. If you don't agree (or defect later), we'll just immediately resume and win"

Ideally also persuade them there are risks and it's worth everyone slowing down for them, and apply pressure in other ways, but not sure that's even necessary.

SpicyLemonZest7 hours ago | root | parent

[dead]

dakolli4 hours ago | parent | next

This is all marketing, you don't have to believe everything a company is saying about themselves, and you shouldn't.

Although, I could see Anthropic making a model purposely dangerous so there are bad outcomes and they can use that to their advantage for regulatory moats, and or in general make people think its more "alive" than it is. For some reason many people associate dangerous actions taken by llms with intent.

trollbridge4 hours ago | root | parent

No kidding. If my LLM issues commands to an agent to delete files I want to keep, that's not "intent" or the model somehow become evil - it's just a bad model that's not doing what I want.

But, for marketing purposes, it's quite effective to portray your model as having some cosmic struggle between good and evil in itself.

Rekindle80908 hours ago | parent | next

[dead]

eudamoniac6 hours ago | parent

It doesn't know. It's not willing. It's not thinking. It is predicting the next token.

umanwizard6 hours ago | root | parent

Please define what "predicting the next token" means. The next token according to what probability distribution? Couldn't every process that produces text (including humans writing) be modeled as predicting the next token according to some distribution?

#visit	13,691,092
#session	74,665
#live-session	0