Claude Opus 4.8

https://www.anthropic.com/news/claude-opus-4-8

1654craigmart | 22 hours ago | 1289 | HN

A rambling comment:

I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).

So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.

Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.

But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.

onlyrealcuzzo21 hours ago | parent | next

I won't be surprised if the next gen frontier models are the last.

There's orders of magnitude of low hanging juice to squeeze out of smaller models.

It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).

It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.

Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...

You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.

Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.

There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...

vlovich12321 hours ago | root | parent | next

Took me a while to find what you were referring to by gram. Arxiv paper from 9 days ago that's not properly indexed by search engines.

(G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.

https://arxiv.org/html/2605.19376v1

loading story #48312996

loading story #48313152

loading story #48313012

loading story #48314090

loading story #48319027

loading story #48317201

loading story #48316952

loading story #48313006

loading story #48315231

loading story #48320346

mrandish18 hours ago | root | parent | next

> Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T param

I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.

Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.

They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.

loading story #48316129

loading story #48315046

supern0va21 hours ago | root | parent | next

>It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.

If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.

I'm curious if someone here with a stronger background in the space has a similar intuition or not.

loading story #48316463

loading story #48314176

loading story #48312601

loading story #48312485

loading story #48320400

sometimelurker20 hours ago | root | parent | next

I looked into this "GRAM" stuff a sibling comment links further to, and just to say:

- this gets reinvented/rediscovered constantly under different names

- it cant be trained very well (right now, will change)

- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)

- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

I follow this stuff closely, I think I know what I'm talking about (edited for formating)

loading story #48314810

loading story #48313351

nbardy16 hours ago | root | parent | next

There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

Most software engineers will just need cheap tokens.

But things like physics and drug discovery have no foreseeable upper bound.

loading story #48317913

loading story #48317974

notrealyme1237 hours ago | root | parent | next

The GRAM model is so much into my research direction, I love it. Thank you for posting it.

Where do I find papers like this? Outside of hacker news comments. It's so hard to find the good stuff in all the noise IMO.

ACCount3716 hours ago | root | parent | next

GRAM is another one of those "stupid specific architectures" - same as HRMs, etc. It can sort of contest LLMs at specific puzzles. It demonstrated that much. It's not a general contender with LLMs at LLM tasks.

If you subscribe to things like "there are tasks LLMs are innately bad at due to insufficient depth and lack of recurrent capability", then GRAM might be another signal towards that.

But keep in mind: even ARC-AGIs have their frontiers dominated by LLMs. Even if "innately bad" is true, it clearly doesn't go all the way to "innately incapable".

loading story #48316614

jruz21 hours ago | root | parent | next

Absolutely that’s why they’re rushing to IPO now to squeeze the last drop of the bubble they know this is a dead end.

loading story #48314356

loading story #48312623

loading story #48312627

redox9917 hours ago | root | parent | next

Small models don't have enough parameters to memorize the entire internet. For very common prompts you don't notice that, but when you rely on some niche knowledge that might only appear once in the entire web, a single blogpost, a single github issue, a single pdf, you need to be lucky enough that the agent runs a web search AND it returns what you need.

Even as humans there's so much knowledge out there that exists but it's very hard to surface unless you know exactly what you're looking for beforehand.

slashdave21 hours ago | root | parent | next

I think you are assuming training from scratch, which I doubt is happening here. Fine-tuning and RL, especially based on synthetic feedback (coding skill, in particular) can be ongoing and is where these models obtain truly useful abilities.

qurren17 hours ago | root | parent | next

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks

The benchmarks need to change. The current coding benchmarks don't capture the realities of software engineering.

I had a bunch of images that got masked by some logic, I had to evaluate something on the original images, Claude 4.7 decided to inpaint the masked images instead of just fetching the actual unmasked images from upstream.

I had another model once that decided that because it couldn't figure out how to fill out a form to log into HuggingFace to download weights for some open source model that it was going to instantiate the model with random weights and run inference on a thousand images.

Its coding was fine, but the solution was not the right one.

hellohello221 hours ago | root | parent | next

"It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years"

What insight do you have to make this claim?

loading story #48313117

loading story #48313059

loading story #48313610

UncleOxidant15 hours ago | root | parent | next

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years

Given how well Qwen3.6-27B performs for such a small model I think you could be right. I suspect that Google,OpenAI,Anthropic must be looking at the Qwen3.6 models (as well as Deepseek V4-flash, MiMo-V2.5) and wondering if they could make some smaller models that are specifically trained for certain activities - like coding. Smaller, more targeted models would take up a lot less resources.

loading story #48317704

mucle621 hours ago | root | parent | next

> I won't be surprised if the next gen frontier models are the last.

the last?!? I'm excited to see :) I'll take the other side of that since llms are so new

loading story #48312878

nbardy16 hours ago | root | parent | next

There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

Most software engineers will just need cheap tokens.

But things like physics and drug discovery have no forseeable upper bound.

loading story #48316380

loading story #48316587

mickdarling19 hours ago | root | parent | next

I effectively distill the frontier models by building whole sets of skills, personas, and other artifacts that I can then run on smaller models and get 10% even 20% improvements on models like haiku or local models.

There's a lot of room for improving the smaller models at many levels of the stack.

loading story #48317751

merlindru21 hours ago | root | parent | next

surely training also gets cheaper so justifying it becomes easier?

i think it'll be more like we get 1-10T models and then distill those down into smaller models, though

It seems like the best small models today are all distilled from bigger models

Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos

dbbk19 hours ago | root | parent | next

I'm frankly surprised the focus is still on these enormous "know everything in the world" models. I would think you could create an incredibly lean and smart "just React and React Native" model.

loading story #48315996

loading story #48314629

loading story #48316221

yomismoaqui21 hours ago | root | parent | next

Let's hope that hitting a scaling wall and less money to spend will begin redirecting efforts to optimize inference and get the same results with less compute.

Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.

ishurand419 hours ago | root | parent | next

And anyway, with quantum, there will be no need for frontier companies as you might be able to even run a 1T param model on a consumer quantum computer.

loading story #48314601

loading story #48314709

loading story #48319301

firebirdn9921 hours ago | root | parent | next

you just need to look at Mythos to see the jump in performance from a 10T(?) model. As they scale, they get more capable. We might have an yearly release, but I believe the releases will continue, as long as scaling laws are in tact, and there's huge problems still need solving. (think cancer)

loading story #48312850

loading story #48314089

loading story #48313309

Forgeties7921 hours ago | root | parent | next

> I won't be surprised if the next gen frontier models are the last.

I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.

The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.

loading story #48313495

Gomotono20 hours ago | root | parent | next

I don't think this is true at all. It might feel like this because we are used to a very very fast release cycle but we are only in this topic for a few years.

We have so many ways of optimizing:

- continusly creating more and better training data

- increasing parameters to 20/50/100TB

- We still wait for Mythos access

- We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)

- Reinforcment learning and evolutionary algortihm only started to appear

- If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones

- We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around

- Research for Diffusion and other models is still in progress

- Nvidia just announced/showed a 7x speedup on inferencing for Nemotron

- Multitoken prediction became available just a few weeks ago

- Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)

- World models are showing great progress and we do not know yet what they will bring to the table

- They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity

- We see more and more mulit modal models (these also consume compute)

- N-Gram paper and co i have not seen all of these things in chinese open models

- We don't even know yet what Meta is doing, but we do know they restarted their efforts again

- Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations

- We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.

- We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this

- Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness

- ChatGPTs Image model 2.0 got relevant better and came out just a month ago

I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.

Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.

There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.

I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.

loading story #48318874

loading story #48314245

guluarte20 hours ago | root | parent | next

I think the future will be enterprise clients will train their own models based on their needs and data.

loading story #48314910

loading story #48314438

loading story #48315185

YetAnotherNick21 hours ago | root | parent | next

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.

> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.

loading story #48312583

loading story #48312690

fnord7719 hours ago | root | parent | next

So, then I guess the big three are never going to make their money back.

wahnfrieden21 hours ago | root | parent | next

I would be shocked if 5.5 is the last new pre-train from OpenAI. Your comment is nonsense.

loading story #48313838

michaelchisari20 hours ago | root | parent | next

| a 60-90B model can outperform current SOTA

My conspiracy theory is that Apple recognizes this.

loading story #48313245

loading story #48315475

loading story #48314120

loading story #48313238

loading story #48320391

colin4k10249 hours ago | root | parent | next

[dead]

frankest17 hours ago | root | parent | next

[dead]

lichenwarp19 hours ago | root | parent

[flagged]

gen22021 hours ago | parent | next

I'm curious to poll HN on this issue. Do you feel like we've had meaningful/noticeable gains in terms of your programming workflows between 4.5 and 4.7?

My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.

But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.

fittingopposite9 hours ago | root | parent | next

Yes. You and some random indigenous guy in the Amazon likely share the same intelligence but you are more capable because you have access to writing/reading, computer, car etc. Intelligence is more than raw intelligence. It's harness, skills, tools, memory etc. If you improve all the latter but keep the raw intelligence (LLM) fixed, you certainly get better results. Same with us humans.

loading story #48319477

Bnjoroge21 hours ago | root | parent | next

For long-running tasks, yes 4.7 has been a noticeable improvement. Goes off the rails alot less than 4.6 does. For shorter-sized windows, I havent felt as much and agree that the harness improvements have been fhe biggest lever

loading story #48313953

bonoboTP21 hours ago | root | parent | next

To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a style/personality change regarding how much it asks back, how much it assumes, how eager it is to jump to action etc but not really in terms of my perception of its smartness.

onlyrealcuzzo14 hours ago | root | parent | next

In my experience, 4.7 was a noticeable step down from 4.6.

I was one of these people that Claude would never finish anything and just randomly say, this is a good stopping point, I think you should go to bed.

And then I'd tell it to continue, and it would burn tons of tokens, make no progress and say, "This is a really good stopping point..."

Canceled and switched to Codex and have been pretty happy with it. It doesn't plan as well as Claude, but I think it does better implementation - and neither of them can actually come up with good plans without a ton of help...

Codex is also way faster.

somenameforme21 hours ago | root | parent | next

They all feel, more or less, the same to me in terms of output capabilities. Mostly get simple things right, can get more complex things right with nudging, eventually get stuck hard on something that takes a bunch of iterations through it/logging/etc or me fixing the code manually.

bcrosby9520 hours ago | root | parent | next

4.6 felt a bit better than 4.5 but slower. 4.7 doesn't feel better than 4.6.

giraffe_lady21 hours ago | root | parent | next

I actually don't see any personal productivity improvements from using opus over sonnet for coding. If you're keeping tasks small and conversations short, reading the code and correcting before changes go in, whatever advantages opus has aren't practically significant. It's also just talky as hell, overexplains anything it touches and every token produced this way increases the surface area for hallucination so you need to have your guard up even more with it.

There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.

alfalfasprout18 hours ago | root | parent

I'm actually currently studying this :)

Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.

4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.

So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.

gAI21 hours ago | parent | next

4.7 was the first time I had to resort to using the previous version (4.6) for most use cases. Hoping 4.8 rectifies this.

ishurand419 hours ago | root | parent | next

They just showed the benchmarks it improved on but it regressed on so much more, such as the MCRR benchmark: "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6."

merlindru21 hours ago | root | parent | next

Same. 4.7 felt like a definite regression

loading story #48312315

loading story #48316305

ruairidhwm17 hours ago | root | parent | next

I managed to find that Haiku outperformed Sonnet on some tasks...don't want to blog spam but if anyone is interested: https://www.ruairidh.dev/blog/sonnet-4-6-drops-format-rule-o...

petterroea21 hours ago | root | parent | next

Same. 4.7 has done some incredibly stupid things.

loading story #48314619

rhubarbtree21 hours ago | root | parent | next

Same. So happy when I found that option.

loading story #48312557

tanepiper18 hours ago | root | parent | next

Yep, until 1st June 4.6 is still x1 on Copilot, but will jump up quite a bit in coat - 4.7 was already highly priced, and the output was frankly terrible.

It still seems trying to build general models is mostly cost prohibitive - the frontier model provider and resellers are repricing in such a way the return on investment is dropping as developers and users become more cautious of burning their limits.

I'm still of the opinion that models like 4.6 don't need to be improved on - rather they need to be better integrated with more domain specific models in agentic flows.

dezsirazvan18 hours ago | root | parent

same!

mrandish19 hours ago | parent | next

I suspect the more frequent incremental releases may also be to deploy new capabilities used by Anthropic to control costs and throttle consumption of resources. I assume any new controls they expose to end-users have far more granular sub-controls under the hood which they can meta-adjust for each user type.

They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.

SkyPuncher21 hours ago | parent | next

> My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.

Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.

I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.

michaelsalim17 hours ago | root | parent | next

Same here. Went back to 4.5 and was happy I did it. The only frustration was that I can tell the model has declined compared to the first few weeks it was released.

I also recently moved to 4.6 since I started hitting the context limit too often with my current project.

loading story #48317858

dwaltrip21 hours ago | root | parent

If you are using Claude code, just set effort to xhigh.

This one change will probably solve 80% of the problems you have noticed.

loading story #48312953

loading story #48314712

gertlabs21 hours ago | parent | next

4.5/4.6 were roughly the same in our testing. Opus 4.7 is smarter, but it's difficult to use as a product for various personality issues. So far, Opus 4.8 seems to be going down that path (unusably slow, but this could be a launch day rollout problem). Full Opus 4.8 tests are in progress now.

Data at https://gertlabs.com/rankings

__s20 hours ago | root | parent | next

"personality issues" I was able to tell that Opus 4.7 would take instructions more literally, which I appreciated once I calibrated my phrasing to be more precise (often asking to investigate issues, pre-4.7 it'd start making code changes instead of just giving write up). But I can see contexts where handling vague prompts would've just been worse

swingboy13 hours ago | root | parent

Looking forward to the results. Thanks for your work.

loading story #48318854

permute8 hours ago | parent | next

I am using Claude Code for formal verification with Lean. In my personal experience both Opus 4.7 and now what I see from first experiments with Opus 4.8 were big improvements. I was able to delegate proofs of larger theorems that their predecessors could not handle.

light_triad21 hours ago | parent | next

I've been using Claude Code regularly since the 4.5 release, and 4.7 was a significant regression: very unreliable, arguing about changes, deciding that fixes weren't needed, etc.

I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.

gandalfthepink8 hours ago | parent | next

May be my tasks are rudimentary but the results I get with the 4.5 model are just the same as 4.7 or 4.6. it's just at the advanced models consume more tokens and and are actually loss making for my work. The incremental changes that they are making are not really that valuable. In fact I have found that even glm 5.1 is giving me something equivalent to what Opus 4.6 gives. Am I missing something that everyone else is cheering for in these small incremental model releases?

andersmurphy7 hours ago | root | parent

I wonder if it's being done to improve revenue nunbers without changing an enterprise contract? Oh what's that your token usage went up because some of your developers switched to a new model? That sounds like a you problem.

I thinks there's a big push to get these companies in a state where they can be dumped on public markets.

ricardobeat21 hours ago | parent | next

4.7 was a significant jump in the ability to run long-horizon tasks. It immediately completed tasks that 4.6 was unable to, even though I have the impression that it became a bit less capable over the first few weeks after release.

It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.

viking1238 hours ago | root | parent

It didn't do shit

WhitneyLand20 hours ago | parent | next

“Maybe my own tastes are saturated now”

It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.

One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.

Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.

Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.

Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.

It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.

Ar what point does my CS degree become totally useless is an open question.

hypfer18 hours ago | root | parent

> At what point does my CS degree become totally useless is an open question.

Why are you people saying all these things.

We'll probably see long-distance space travel long before a degree in generic problem identification and solving becomes totally useless.

loading story #48318347

ahmadyan20 hours ago | parent | next

pretty spot on.

In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.

4.1 they made it much faster, so a lot of infra improvements.

4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.

4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.

4.7 they just fixed the bugs they added in 4.6. Better than 4.5.

haven't fully tested 4.8 yet.

teruakohatu19 hours ago | root | parent

I gave 4.6 a miss and only recently switched from 4.5 to 4.7. I found on a particularly different task 4.5 struggled with (getting stuck in loops and trying to convince me the problem had been solved) was quite solvable with 4.7.

willtemperley8 hours ago | parent | next

I'm here to complain about the churn.

I feel like I get to know a model in the human sense of understanding a personality. Yesterday I knew 4.6 extended, today it's different, there's multiple "token budget" levels. I just want 4.6 extended back as it was, I was getting on well with it / them.

lionkor8 hours ago | root | parent

Humanizing this technology seems like a step in the wrong direction.

loading story #48319874

theptip18 hours ago | parent | next

My read - 4.7 was a tactical lobotomy to improve the average experience at the expense of peak performance; necessary due to compute pressure.

Now that they have Colossus capacity, I guess they can tune up the intelligence again and spend more tokens on reasoning budgets.

4.7 was definitely a lot more flaky for me vs. 4.6 before the reasoning bugs.

binary001021 hours ago | parent | next

Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?

osigurdson21 hours ago | root | parent

I find the quality ebbs and flows even on the same model. My guess it is something to do with GPU availability but only guessing.

loading story #48312868

irthomasthomas21 hours ago | parent | next

Given that 4.7 was a brand new model, trained from scratch with a unique architecture and tokenization scheme, I don't see the same pattern. It seems arbitrary.

dominotw21 hours ago | root | parent

i dont understand the nuances here. what does this mean. 4.8 is trained on same model as previous one then? what does brand new mean.

loading story #48312696

extr21 hours ago | parent | next

IMO they have all been clean and noticeable upgrades over their predecessors. Opus 4.7 in particular was a solid jump in capabilities.

NiloCK21 hours ago | root | parent | next

I think it's telling how split the opinions are around all of this. A lot of people distinctly disliked 4.7.

Are the dividing lines around personality? Working domains? Opinionated software stuff?

Who knows?

TSiege21 hours ago | root | parent | next

most of my coworkers feel the opposite about 4.7 and that 4.6 was, to them, significantly better to point that several stopped using claude code

loading story #48314556

viking1238 hours ago | root | parent

It didn't change at all, same as 4.6. Good morning to the Anthropic office btw.

loading story #48320438

spaceman_202020 hours ago | parent | next

I think 4.7 was an awful model in actual use. I never got anything out of it and it was frustratingly weird. This feels more like an attempt to course correct and isn't a real bump

throwaway6346719 hours ago | root | parent

I think they overtrained on scientific papers or such as it would spout really sophisticated sounding nonsense with a ton of complicated verbs and adjectives. 4.6 was definitely better in that regard. The more I use these tools the more I think they’re not actually that revolutionary. I mean it’s still amazing what they can do but they have very clear limitations it seems.

root-parent15 hours ago | parent | next

ChatGPT 5.5 is consistently the much better model and by a large margin.

How do I know? Because when pushing both to generate code or in independent chats to analyze projects, 5.5 will consistently find all the bugs that Claude does not find, and when challenged, Claude does agree those bugs were there. And my findings match those.

When from a blank start asking Claude to analyze project A and Project B,. Clause will consistently say project B is the better structured, more robust, and more defect free and does justify it. And project B was the one created by GPT 5.5....And also the one I judge to be the best one.

And yes, both at deep effort settings and starting from same specs...

viking1238 hours ago | root | parent

5.5 is much better than any Anthropic model. I hate both companies with passion but the Anthropic shills here are in overdrive mode. On top of it, it's cheaper.

Greetings to the Anthropic office good sirs btw.

nfw214 hours ago | parent | next

I think the issue with legibility comes down to the fact that most users are not using LLMs for tasks where improvements to raw reasoning abilities wouldn't help much or at all. So it's not a matter of anyone's deficiency of perception but rather a lack of any benchmark to perceive.

It's kind of like how the consumer laptop market is now. I was telling my boss today that most employees wouldn't see any noticeable performance difference between a macbook pro and a neo if they are just doing admin stuff on the web.

ThunderBee18 hours ago | parent | next

IME the most noticeable performance boosts are in complex multi-agent workflows.

EX. You call an orchestration agent and define an implementation plan with the help of a number of sub agents planning out different features. You and the lead agent review all of the plans and send them off to a set of agents that write tests which get send back to the orchestrator then passed along with the plan to a set of coding agents who implement the features in their own worktrees. That gets passed back to the orchestrator which hands it off to another set of agents doing the code review and merging the features before sending it back to you.

8note18 hours ago | root | parent

i dont think theres anything particularly special about new models for that though. thats a harness improvement

loading story #48318082

jimbokun19 hours ago | parent | next

How long would it take to evaluate a new coworker to say “wow she’s really bright?” Relative to your other coworkers?

A few days? A few weeks? Longer?

However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.

byzantinegene11 hours ago | root | parent

alot of investor money is hinging on models performing better every release.

cootsnuck19 hours ago | parent | next

Well, it seems like collectively we are all struggling to perceive model progress, given that it seems like every reply to you is reporting different experiences with which of the models has subjectively performed best for them.

bwhiting23568 hours ago | parent | next

the churn is... a version bump to the same api? If you want to compare you can write some evals.

j_m_b12 hours ago | parent | next

We're at the top of the S-curve and you're romanticizing diminishing returns with vague hints of super human capabilities and singularities.

onlypassingthru21 hours ago | parent | next

The honesty will be noticeable. Maybe we'll see some honest assessments like "That is not possible within the laws of known physics", "Your legal argument is nonsensical and defies logic", "There is no evidence to support taking that will cure anything", etc., etc.

ifwinterco19 hours ago | parent | next

4.7 uses more tokens and costs more for the same task than OG 4.5, that's about it

mgraczyk13 hours ago | parent | next

dangerous thing to believe IMO The models will get better, you will notice, everyone will notice. They will get better at coding and everything else. You should plan around that.

hypfer18 hours ago | parent | next

> (it's smarter than me?)

I genuinely hope that you're joking with that statement.

Or this is a bot.

Or an ARG.

Or Art.

Help.

okamiueru18 hours ago | root | parent

If LLMs have tough me anything, is that the average person is far more gullible than what I could have imagined.

loading story #48315416

fl0id15 hours ago | parent | next

tbh, the last 2-3 version bumps, main change has been that they take longer, and cost more/have more usage restrictions. (combined with new tooling, which eats a ton of tokens)

taurath12 hours ago | parent | next

> I'll never again perceive model progress

If the hype train keeps going for another year, Sam and co will have to resort to direct gaslighting like saying the model is improving but nobody can feel it anymore, oh and I need 10 trillion dollars

iLoveOncall18 hours ago | parent | next

I'm pretty sure they're releasing 4.8 because they massively shit the bed with 4.7 and people aren't using it.

I have ONLY heard negative feedback about it, and trying it myself also yielded really awful results.

20 hours ago | parent | next

{"deleted":true,"id":48313797,"parent":48311998,"time":1779994871,"type":"comment"}

jere19 hours ago | parent | next

"it's smarter than me?"

You don't have to correct it dozens of times a day!? Really?

Grimblewald15 hours ago | parent | next

I maintian a log of tasks, prompts, related information etc. So i can repeat past workflows verbatim, and I can qualitatively say each model beyond 4.5 has been a regression, and it would not surprise me 4.8 continues the trend. Each iteration has failed at more tasks previously completed succesfully. Right now it flat out refuses to answer many benign chemistry questions, or leans into shilling to hard and ignores non industry funded studies on certain topics. I'm transitioning to deepseek as a reuslt. Cheaper by far and at this stage not strictly speaking less capable.

mrinterweb16 hours ago | parent | next

The more difficult it is for humans to consistently and accurately compare model outputs the more opportunity there is to spread FUD (Fear, Uncertainty, Doubt). Considering valuations of these companies and the astronomical investments being made, a sabotage campaign with bots or paid users on reddit, twitter, YouTube, or whatever socials could go a long way towards knocking market cap off the competition. Not saying that's happening, just saying its an obvious target. Even if the goal is not nefarious, people with a perceived bad experience are 2-3x more likely to complain. So even without bad actors involved, a new model may need to be significantly better in order to break even on the old net promoter score.

gigatexal20 hours ago | parent | next

why are the models the same price?

https://platform.claude.com/docs/en/about-claude/pricing

``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens

Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok

Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```

teruakohatu19 hours ago | root | parent | next

Why shouldn’t they be? They are probably the same size and cost the same to run. They are not doing full training runs (eg Mythos) so don’t need to recover insane training costs.

loading story #48314700

staticman219 hours ago | root | parent | next

Opus 4.7 and presumably 4.8 are more expensive due to a new tokenizer that translates data into more tokens per input.

nikcub16 hours ago | root | parent

Same price on a token basis, but usually steadily decreasing on a task basis

loading story #48318053

taytus21 hours ago | parent | next

Incremental gains compounds.

itake21 hours ago | root | parent | next

meta threw in the towel when it came to producing AI models since their gains couldn't keep up with China.

loading story #48315323

loading story #48312851

paulddraper21 hours ago | root | parent

Exactly. Go back to Opus 4.5 and see how you like it.

You won't, really.

conartist621 hours ago | parent | next

Just want to say there's no question that you're smarter than any (and every) AI.

NiloCK21 hours ago | root | parent | next

I appreciate the generosity, but you're gonna want to meet me first.

loading story #48312945

petesergeant21 hours ago | root | parent

No question at all that a dolphin swims better than a submarine.

vasco14 hours ago | parent | next

I can tell from hearing Feynman recordings that he was smarter than my own university's physics professor, but both were smarter than me.

overgard16 hours ago | parent | next

It's almost like they used up most of the benefits of scaling and the fundamental issues that people have been talking about with LLMs for years are real.

avador17 hours ago | parent | next

The inability to tell if a model is improving is, I think, a tell that the model has improved up to your level of programmatic (analytic, computational) capacity.

A lot of the information (blogs, tweelches, plosts) that I consume seems to be converging on the idea that we all depend on the models. However. It seems to me that the exact opposite is true. The models depend on us, and _desperately_ so.

There must have been stories, books, movies, made about this intellectual (and propositional, legal, factual) inversion.

The majority need the minority. Has always been the case, I now think. But what has newly developed is that the majority can take a dependency not on the minority, but on a select few companies who are abstracting and compressing the minority into latent spaces.

adi_kurian13 hours ago | root | parent

Or the model could just be shite.

8note18 hours ago | parent | next

honestly sonnet 3.7 is still good enough for me, as long as whatever tool prompts and so on are well optimized enough between harness and model.

i still havent really noticed it per set being better

ElkeQin11 hours ago | parent | next

[flagged]

19 hours ago | parent | next

{"dead":true,"deleted":true,"id":48314116,"parent":48311998,"time":1779996191,"type":"comment"}

ckarani8 hours ago | parent | next

[dead]

rotcev19 hours ago | parent | next

[flagged]

Imustaskforhelp20 hours ago | parent

Although I am not sure about it but there was something I read which said that models intentionally degrade slowly by lower quantizations as a new model is going to drop.

This felt particularly visible during the 4.6 when people said that 4.6 felt dumber and I remember someone doing some analysis and it sort of proved that models were getting dumber over time.

This has both benefits of costing less for the company to run while taking a standard subscription but also, at the same time, making the next model when it drops to public to "feel" more good comparatively.

Again, I am not sure if this is the case or not but merely proposing something that I feel like it might be in the possibility of realm.

loading story #48323207

senko20 hours ago | parent | next

My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode mode nailed it, the best result so far:

https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v

The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).

egeozcan10 hours ago | parent | next

I've been tasking LLMs to write a traditional AI for a full vibe-coded RTS. I remove the human players and let them battle. I don't know why but I enjoy watching AI players battle so much :)

In the repo, I even have a tournament script that calculates ELOs. So far, codex was unmatched. I'll try with Opus 4.8 too.

https://egeozcan.github.io/unnamed_rts/game/

https://github.com/egeozcan/unnamed_rts/blob/main/src/script...

calebgcc8 hours ago | parent | next

I wonder if your previous prompts were part of the new RL fine tuning, and that’s why is now better at this specific question

skolos10 hours ago | parent | next

How many times did you try? Same model running multiple times can produce both very good and very bad results. In my benchmark even 10 runs often not enough to tell for sure if one model is better than another.

jclay20 hours ago | parent | next

It almost appears as if the code was minified. The variable names are short and formatting looks like it's written to minimize whitespace. Did it write it in this compact format all on it's own?

senko18 hours ago | root | parent | next

Yeah looks extremely compact. I didn't instruct it or told it to use as few lines of code or characters or nothing of the sort.

Not sure why it did that. Its own rationale (which is highly suspect, but the only lead I have) is that it defaults to dense style if it has to write a file in a single go. May be a kernel of truth somewhere in there.

loading story #48320095

loading story #48316392

andai18 hours ago | root | parent | next

A friend sent me something he vibe coded which included a massive webassembly blob in the HTML file. My friend is not a programmer so he was not able to explain to me how it did that.

loading story #48316534

syspec13 hours ago | root | parent | next

I just had Opis 4.8 code up something and actually that's exactly how it coded it!

It looked gross and minimized, the result was awesome but the code looked pretty awful visually

rphv11 hours ago | root | parent

"Readability by humans" may no longer be as important as it once was.

loading story #48319761

RobinL9 hours ago | parent | next

Nice, I recently found something like this was possible too. Gpt-5.5 one shotted the basic game, but then I added some ai generated graphics/sounds/music and asked it to write then up.

It's a vocab building game, playable here (desktop only): https://rupertlinacre.com/vocab_annihilation/

It kind of blows my mind I can go from: 'I want a fun way to help him learn vocabulary, and I loved total annihilation as a kid' to 'heres a game that's he finds genuinely fun that helps him learn something ' in a few prompts.

apitman19 hours ago | parent | next

I like that benchmark. You should throw the results up on GitHub pages so people can try out the games.

brandly18 hours ago | root | parent

Yeah! Host on GitHub pages, so it's easy to click a link and play!

loading story #48315395

zuzululu7 hours ago | parent | next

some reason that website is showing up as high risk and i cannot view it , I had to open it from my mobile phone.

it looks quite impressive, I don't use claude currently but hearing good things about it...from codex users ironically

loading story #48320602

H3X_K1TT3N16 hours ago | parent | next

Thanks for also sharing the prompt. I've been testing claude by asking it to make similar things, so it's useful to see what other people are doing.

I do find it interesting that the visual style is pretty similar to things it's produced for me.

dash210 hours ago | root | parent

If you look on the page of games, the style of chatgpt 5.5 is almost identical to the Claude style.

jmtame14 hours ago | parent | next

Wow, that's impressive. Had fun playing it for 10 minutes locally. Found myself wanting to discover an enemy base :)

digdugdirk19 hours ago | parent | next

Do you have a collection of these benchmark apps saved anywhere? I'd be particularly interested in seeing the relative cost differences between different models in a use case like this.

senko18 hours ago | root | parent

I'm saving them all as gists here: https://gist.github.com/senko

But I just vibe-coded a handy list of all the tests I did (unfortunately without the commentary I usually leave in social media posts -- I should add those at some point): https://senko.net/vibecode-bench/

jryan4919 hours ago | parent | next

Kinda buggy, but impressively nonetheless. How long did it take?

senko18 hours ago | root | parent

It took 50 minutes, would be ~$20 in API costs (I'm on a Pro sub).

elAhmo19 hours ago | parent | next

What is ultracode mode?

senko18 hours ago | root | parent | next

It's a combination of reasoning effort (max) + enabling workflow that orchestrates multiple sub-agents.

After some interrogation, here's how it organized the work:

1. Design workflow (rts-game-design, 11 agents, ~13 min) ran first, produced SPEC.md + DESIGN.md:

1.1. Proposals (3 parallel agents): each designed a complete RTS from a different philosophy

1.2 Judge (1 agent): evaluated all three and synthesized one unified design, committing to specific numbers (costs, HP, map size, etc.).

1.3 Deep-dives (6 parallel agents): each wrote an implementation-ready spec for one subsystem, all consistent with the chosen design

1.4 Synthesis (1 agent): merged the design + all six subsystem specs into one conflict-free master spec

2. Code-review workflow (rts-code-review, 25 agents, ~5 min), ran after the main agent had written and tested the code:

2.1 Review (6 agents, read-only Explore type): each scrutinized one dimension and returned structured findings.

2.2. Verify (19 agents): every finding got its own skeptic agent told to try to refute it, Result: 19 flagged → 16 confirmed, 3 rejected as non-bugs.

What the main agent did in the main loop:

- Wrote all ~2,400 lines of index.html by hand from the spec.

- All browser testing/debugging via headless Chrome (I told it to use rodney by @simonw, love the tool :)

- Applied all 16 fixes from the review and re-verified them in the browser.

loading story #48316735

loading story #48317542

tcoff9119 hours ago | root | parent | next

it's a brand new mode

colechristensen17 hours ago | root | parent

Biases the model to solve problems with teams of agents

fireant11 hours ago | parent | next

Wow that looks really impressive. Both the UI and the content looks good, the game is a bit buggy but still nice!

ammar_x16 hours ago | parent | next

Is there some sort of a leaderboard for this test? Like if you'd give each of Opus 4.8 and GPT 5.5 a score out of 100, what would the scores be?

senko16 hours ago | root | parent

There isn't, as I wasn't going for strictness, more like a playful challenge in the vein of Simon's SVG pelican.

Between the two, Opus 4.8 seems more capable. But, I suspect the harness plays a large role here. It's possible the result would be as good if Codex ran 10+ agents and spent an hour on it.

OpenAI and Anthropic usually fast-follow each other, so I wouldn't be surprised if Codex got the same capability in a couple of days (and even an update to the model), then it'll be a better test.

Sooo, let's say, winging it, vibes-based: 85% for Opus 4.8, 75% for GPT 5.5. Compare with GPT 5.3 (let's say 25%) here: https://senko.net/vibecode-bench/2026/rts-codex-5.3.html

Madmallard13 hours ago | parent | next

Okay now have it implement an authoritative server with reliable netcode and reconnection/disconnection logic, lobbies, and finding games, in-game chat, synchronized state around starting and ending games, resignations and such

shlewis14 hours ago | parent | next

How much did it cost?

l3x4ur1n19 hours ago | parent | next

Played it to the end. Pretty neat!

veqq10 hours ago | parent

wow

colonCapitalDee22 hours ago | parent | next

"Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

This is a refreshing attitude!

I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)

gibspaulding19 hours ago | parent | next

I’m pretty sure that switch has always been there, but turning it off doesn’t do what you want. It disables thinking entirely.

kakugawa18 hours ago | root | parent

Opus 4.7 does not support disabling adaptive thinking (web, Claude Code). [1] Like the OP, I experienced similar issues and I'm glad that they brought back the ability to disable adaptive thinking in Opus 4.8.

[1] https://code.claude.com/docs/en/model-config#adaptive-reason...

> Opus 4.7 and later always use adaptive reasoning. The fixed thinking budget mode and `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` do not apply to them.

loading story #48315183

ddp2617 hours ago | parent | next

It is refreshing but perhaps actually not warranted this time?

I mostly study web research, and Opus 4.7 was a regression on BrowseComp compared to Opus 4.6, which has been born out by my usage.

Opus 4.8 is now much better than either 4.7 or 4.6, and having it search the web is one of the primary use cases of chatbots.

winwang21 hours ago | parent | next

Awesome, thanks for posting because I think I hit a possibly-spurious bug in turning Adaptive off when I switched models (4.6 -> 4.8, extra). Tried again, works as intended (I hope).

More importantly for me, though, is how CC will respond to 4.6-"only" flags for thinking. For now, it doesn't seem to clobber my setup.

mkozlows17 hours ago | parent | next

I was hoping that the web UI would be better -- I like Anthropic better than OpenAI from a values perspective and want to use their products, but ChatGPT in thinking mode has been just vastly better than claude.ai.So my fingers were crossed that these changes would bring it up to par.

But trying it out... alas, no. Simple factual questions where ChatGPT would go do a quick search and get the facts and report them back to me, get a "Great question! [totally invented bullshit]" from Claude, even with this new model and thinking set to high. I have to explicitly tell it to search to get it to look up basic facts, rather than it recognizing that it needs to do that, like GPT does.

Paracompact16 hours ago | root | parent

What are some examples?

elSidCampeador18 hours ago | parent | next

Are they doing these smaller releases to attune users to a more incremental cycle of updates? Like, yeah other model providers do these major updates every x months, we on the other hand do incremental updates every x/2 months

jascha_eng21 hours ago | parent | next

The benchmark improvements actually look pretty damn nice tho!

smartmic20 hours ago | parent | next

> This is a refreshing attitude!

Well, I think the attitude is that costs are allowed to escalate faster and more steeply than the features delivered. From that perspective, semantic versioning is a handy tool for adjusting pricing strategies. IMHO, it (versioning) only makes sense for open-source projects, where you can clearly see the actual changes made with each version upgrade. Anything else is more than a little suspicious…

drewnick20 hours ago | root | parent | next

While all these models are nondeterministic a feature bump is still necessary as the same input can have wildly different output on a new model. For API users being able to pin a model is a necessity.

smsx20 hours ago | root | parent | next

The 4.8 model costs the same as it's 4.7 predecessor.

loading story #48315941

zaptheimpaler20 hours ago | root | parent

All the 4.x models are still available, and they all cost the same.

loading story #48314948

comboy19 hours ago | parent | next

"We've cut our costs A LOT"

empath7513 hours ago | parent | next

I was working with opus 4.7 on a math formalization problem for several days and 4.8 one-shotted the proof from a clean description as soon as the update came through. I was very surprised.

21 hours ago | parent | next

{"deleted":true,"id":48312939,"parent":48311843,"time":1779991359,"type":"comment"}

casey211 hours ago | parent | next

You act like they weren't fearmongering about Mythos literally 2 months ago. Do you think everyone is stupid, we know exactly what you are doing. Please.

wahnfrieden21 hours ago | parent | next

What's refreshing about it given the context that 4.7 was a regression in many ways (including as measured by benchmarks)?

4.8 is also 2x more expensive for a "modest" performance bump. How refreshing.

This is just cope.

cootsnuck18 hours ago | root | parent | next

> 4.8 is also 2x more expensive for a "modest" performance bump. How refreshing.

Where are you seeing it's 2x more expensive? https://platform.claude.com/docs/en/about-claude/pricing

loading story #48315130

murkt18 hours ago | root | parent

Price hasn’t changes at all, though.

loading story #48317412

FergusArgyll21 hours ago | parent | next

I liked the "modest but tangible improvement" too! There is a cynical take here but I think I'm gonna hold it in...

ai_slop_hater20 hours ago | parent

What do you mean? This is not just a new model, this is a new way of thinking.

northern-lights22 hours ago | parent | next

> Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Probably more interesting than the 4.8 release.

zamalek16 hours ago | parent | next

> Probably more interesting

It is widely suspected that self-inflicted "bad news" ("Mythos is so dangerous we just can't give the public access to it") is nothing more than Dario's typical style of marketing - keep in mind that they have an IPO coming up, because he certainly factors that into everything he says in public (as is his responsibility, to be fair).

An alternative reason for delaying the model might not be "we are trying to make it safe." It could be "we don't know how to host this thing at scale, or cost-effectively".

GPT 5.5 has already been shown to be as adept as Mythos at finding vulnerabilities.

Finally, laymen massively underestimate the importance of the harness for model performance. OpenHands existed long before Claude Code, Claude Code changed everything because of the clever hand-holding it does. Mythos is definitely more than just a model.

clbrmbr12 hours ago | root | parent | next

One capability that I see is missing from opus is this ability to understand an entire system. My hope is that a mythos class model will be able to comprehend even something as complicated as an IOT system with a hardware and firmware layer multiple API’s backend and different kinds of API and web clients.

The main limitation we’ve had to agentic coding is an understanding of this system that spans processes running on different machines and architectures.

loading story #48320356

LPisGood15 hours ago | root | parent

What sort of clever handholding does Claude code do?

loading story #48317486

andai18 hours ago | parent | next

In the Opus 4.7 release notes they mentioned intentionally making it worse at cybersecurity. [0]

This suggests that they're doing the same thing with Mythos now and the Mythos we get will be nerfed in that department?

Or more precisely, I think they'll have two versions of Mythos, and the scary one will probably continue to require a lot of paperwork.

https://www.anthropic.com/news/claude-opus-4-7

scuderiaseb18 hours ago | parent | next

So this is how they’ll remove access from Claude Pro to the biggest models. You would need at least a Claude Max subscription for the bigger than Opus models I bet.

F7F7F718 hours ago | root | parent | next

Anthropic's wants to sell us Claude Code with no model selection at all.

Opus seems to be overly eager of late to 'vibe' out entire solutions and build out things that you didn't ask for.

/goals is helping set the narrative that does it really matter if Sonnet and 3 Haiku agents got you to that end state...eventually...if its what you asked for?

For better or worse Opus is already handing off 80% of its work to background agents of Sonnet, Haiku, and likely a quantized Opus.

Want model selection? Pay for the API.

loading story #48315542

swalsh16 hours ago | root | parent | next

Its amazing how quickly ive just become accustomed to being a max subscriber. I dont think I could go back to pro.

loading story #48319157

selcuka14 hours ago | root | parent

They have already been experimenting with such ideas [1]:

> Claude Code Removed from $20-a-Month "Pro" Subscription for New Users

[1] https://news.ycombinator.com/item?id=47855832

ac2921 hours ago | parent | next

More interesting than that to me is "we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost"

Sonnet and Haiku look real outclassed for the price with current Chinese competition.

TIPSIO21 hours ago | parent | next

Seems like they might be hinting that if you are not a billionaire or multi-billion dollar company you will just get a limited and nerfed Claude Code slash command /mythos-security-audit or something.

Hope this isn’t the case and that normal average Joe’s of the world don’t get policed out of access.

gs1720 hours ago | root | parent | next

> you will just get a limited and nerfed Claude Code slash command /mythos-security-audit or something.

Unless it's so expensive that we can't realistically use it for anything, I wouldn't complain about getting at least that. I would also rather have the actual model, but that's a useful application of it (and I'm probably not going to afford using it for much more).

loading story #48313637

loading story #48313628

loading story #48313633

hedora21 hours ago | root | parent | next

Isn't OpenAI's public flagship already beating Mythos on penetration testing? I get the impression Mythos is just valuation-juicing for IPO more than anything else.

The fact that they haven't released it yet suggests a cost/margins issue to me more than anything else. Short term, I'll probably keep using Antrhopic, but my long-term bet is that locally-served models win, if only because the quest for profitability will probably lead to intentionally-nerfed / enshittified frontier models.

At other vendors, ad placement within LLM responses is either coming or already here. Anthropic's handling of OpenClaw shows they're willing to engage in anti-competitive behavior, and the courts are not in a hurry to stop them. Why would I pay them $200 a month for such treatment when a $2K box does what I need locally?

loading story #48314698

loading story #48314126

loading story #48315712

dbbk18 hours ago | root | parent | next

What does an average Joe need a Mythos level model for that Opus can't do for them?

loading story #48315353

loading story #48314928

Tepix21 hours ago | root | parent | next

It does sound like an even higher API price tier for sure.

kdmtctl18 hours ago | root | parent

This command would be not so bad for not a billionaire me.

21 hours ago | parent | next

{"deleted":true,"id":48312103,"parent":48311816,"time":1779988283,"type":"comment"}

huflungdung21 hours ago | parent

[dead]

simonw22 hours ago | parent | next

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

keyle14 hours ago | parent | next

It's pretty safe to say that AI will be used on the battlefield making real life and death decisions before it will be able to render a decent pelican on a bike in SVG.

loading story #48320606

culi12 hours ago | root | parent | next

It already has been and this has been widely written about. AI was used to identify and prioritize targets for the US to bomb in Iran.

Here's an article from 2 months ago for example: https://www.theguardian.com/technology/commentisfree/2026/ma...

It was also implicated in the bombing of a girls elementary school which left 168 dead. The US did a "triple tap" to kill any first responders.

https://www.theguardian.com/news/2026/mar/26/ai-got-the-blam...

https://www.theguardian.com/technology/2026/apr/01/dont-blam...

loading story #48318711

notatoad13 hours ago | root | parent | next

the battlefield sounds much easier. worst case scenario you kill somebody, but that's what you're trying to do anyways.

if you kill somebody while trying to render a pelican on a bicycle it's a real problem.

loading story #48319284

ares62311 hours ago | root | parent

I think that's a fair tradeoff. There's no way I'm going back to writing code by hand again. No one deserves that.

loading story #48318935

GistNoesis20 hours ago | parent | next

> the bicycle frame is the correct shape

No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.

Hopefully 4.9 will read my comments :)

loeg20 hours ago | root | parent | next

Could be an extremely high angle stem that just happens to match the downtube angle.

Venkatesh1016 hours ago | root | parent

Maybe the pelican is just riding a road bike/gravel bike

eminence3218 hours ago | parent | next

I bet someone shares this link every time you post about bicycles, but since I didn't see anyone share it yet in this thread, I'll take the opportunity to do so:

https://www.gianlucagimini.it/portfolio-item/velocipedia/

Turns out even humans can be pretty bad at drawing bicycles :)

walthamstow17 hours ago | root | parent | next

On a new model release, you can guarantee two things are in the replies to Simon. One is your link, the other is "surely the models are being trained on this now"

saghm15 hours ago | root | parent | next

Sure, but no one is trying to force art from most people into about every area in the economy where anyone ever pays for something visual. If you asked professional artists to draw a realistic bicycle, I'm guessing few of them would try to just randomly guess what the mechanical parts looked like

kvirani16 hours ago | root | parent | next

> The most unintelligible drawing has also the most unintelligible handwriting. It was made by a doctor.

Haha

skydhash18 hours ago | root | parent

But if you need to draw a bicycle, you wouldn’t pick a random person in the street. You would hire an artist and you’d be guaranteed to have at least a believable one if not a perfect rendering.

No guarantees is why LLM is akin to gambling. Every new context is essentially picking someone out of the crowd.

jonas2121 hours ago | parent | next

Glad to see that the "high thinking" level adds a helmet. Always a smart choice.

usef-16 hours ago | root | parent

And yet some people doubt Anthropic's commitment to AI safety

simonw19 hours ago | parent | next

Here's pelicans in all of the thinking levels - low, medium, high, xhigh, max

https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

15 hours ago | root | parent | next

{"deleted":true,"id":48317223,"parent":48314391,"time":1780012860,"type":"comment"}

ionwake17 hours ago | root | parent | next

I like the way the max pelican has a stern look on his face

stratos12319 hours ago | root | parent | next

Is the output on the max level meant to be missing?

loading story #48314506

motza15 hours ago | root | parent

low: yolo

medium: redesign bike so peli can reach bars

high: redesign bike so peli can rest on frame

xhigh: yolo

max: big peli reach bars

spmartin82321 hours ago | parent | next

You've peed in the pool Simon, this has to be a part of the internal evals by now! You got to try something new - maybe a panda in a canoe?

phainopepla221 hours ago | root | parent | next

If these were in the internal evals then the output would be much better. The 4.8 pelicans are pretty meh

HDThoreaun21 hours ago | root | parent

Click the link

ceroxylon21 hours ago | parent | next

I really like that thinking level high gave the pelican a helmet.

Xunjin21 hours ago | parent | next

Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.

simonw20 hours ago | root | parent

I don't think the API supports "max" as an option, that might just be a Claude Code harness thing.

UPDATE: My mistake, the API does support max. I added a max one at the bottom of this page (cost 43 cents): https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

loading story #48318408

yanis_t21 hours ago | parent | next

Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects

simonw21 hours ago | root | parent

I've been meaning to do a "run 3 times and pick the best" version for quite a while, I should really pull the trigger on that one. Currently it's one-shot only.

loading story #48312702

loading story #48316898

silisili20 hours ago | parent | next

The vast majority (if not all) of these make it impossible to turn, among other fun things. Only out of curiosity, have you tried prompting further with how a bike must operate to see if it does the right thing?

fendy300212 hours ago | root | parent

tried it myself, not much of difference

https://gist.github.com/fendy3002/3026a8c4d67d1301666ec40fc0...

looks like the model already trained well on both bicycle and pelicans

impalallama17 hours ago | parent | next

I actually like the 4.7 the most, interestingly enough. Not like you can "objectively" weight artistic output like this.

toastmaster1120 hours ago | parent | next

I find the most miraculous thing about 4.7 to be that the pelican is facing left, wonder why the right facing everything is so ubiquitous in these images.

i00020 hours ago | root | parent | next

This happened to me in elementary school. We were doing fingerpaintings using plasticine. After all the bikes were hung on the wall, mine was racing the other way... Somehow it really stuck with me.

loading story #48315745

gboss20 hours ago | root | parent | next

It's facing left but looking right...

loading story #48313819

tancop19 hours ago | root | parent

[dead]

prmoustache16 hours ago | parent | next

I don't see how a frame without a headtube can be "the correct shape".

timsuchanek21 hours ago | parent | next

thanks for always providing this very much on time. I'm wondering what the next, harder challenge could be? Maybe some animated svg?

1attice21 hours ago | parent | next

That little red hat on hard mode is sending me. 4.8 has whimsy

nickvec21 hours ago | parent | next

Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)

simonw21 hours ago | root | parent | next

Good call, it's cute: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304... - but nothing like GLM-5.1: shttps://static.simonwillison.net/static/2026/glm-possum-esco...

373838484821 hours ago | root | parent

surely my nigga simon wouldnt leak his tests to my nigga dario beforehand

fragmede18 hours ago | parent | next

For comparison, what's GPT-5.5 producing today?

simonw17 hours ago | root | parent

The reasoning xhigh one is pretty solid: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

loading story #48316805

highwaylights21 hours ago | parent | next

Am I allowed to say that pelican's little helmet is adorable? I can't provide a strong computational proof, or even a shred of anecdata...

...but that pelican's little helmet is adorable.

whalesalad20 hours ago | parent | next

Eventually the frontier model folks are going to pick up on your pelican on a bike test and bake-in flawless results for that particular request.

onlyrealcuzzo21 hours ago | parent

4.7 reigns supreme IMO.

loading story #48321475

hereme88818 hours ago | parent | next

Early ArtificialAnalysis.ai results show GPT 5.5 is still the better bang-for-your-buck.

OpenAI solves tasks with about 50% less output tokens.

https://artificialanalysis.ai/?intelligence=coding-index&int...

cesarvarela18 hours ago | parent | next

I give Codex a try with every new version, and we don't match, so this isn't true for everyone.

Claude would need to be much more expensive for me to switch.

ai_fry_ur_brain17 hours ago | root | parent

People be saying these things with certainity. 99% of the time one has just inspired more confidence through sycophancy, or just good varience in outputs for a session/prompt.

Slop heads be swearing by one slot machine one week and swearing it off the next like an addicted gambler describing their favorite slot machines from week to week.

This isn't a coincidence, these companies hire UX designers from mobile gaming and online gambling to help engineer their addictiveness.

Its all in your head, and the output is no matter what always going to be worse than learning how to do something yourself and putting care into it.

Handmade watches > mass manufactured watches. There's nothing special about the skills needed for the guy who runs a conveyer belt at a watch manufacturer in China. The watch made by the guy who makes one watch a month in Switzerland is prized and beloved.

loading story #48316700

loading story #48317324

8 hours ago | parent | next

{"deleted":true,"id":48319953,"parent":48314967,"time":1780038207,"type":"comment"}

fHr14 hours ago | parent

Codex with 5.4/5.4 is great Idk havent seen anything more crazy with claude + more expensive

mgambati12 hours ago | root | parent

GPT 5.5 and 5.4 are such great models. I just tried opus 4.8 and took 30 minutes to be confronted with a bit laziness that makes me go crazy. 5.5 just doesn’t have this issue.

loading story #48319921

onlyrealcuzzo22 hours ago | parent | next

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.

aronowb1422 hours ago | parent | next

https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report

XCSme20 hours ago | root | parent | next

Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks.

I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).

Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].

[0]: https://aibenchy.com

[1]: https://news.ycombinator.com/item?id=48230368

loading story #48315041

reckless20 hours ago | root | parent | next

No way is Muse Spark generally better than offerings from Google and OpenAI. I actually find arena to be amongst the most useless indicators

loading story #48316137

Bnjoroge21 hours ago | root | parent | next

Have you seen https://deepswe.datacurve.ai/blog? This is the closest to a vibe check i’ve felt even with the open models.

Imustaskforhelp20 hours ago | root | parent

This actually looks like a really good test.

There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)

I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.

Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek

But mimo seems like an interesting model and they are having some crazy discounts too.

Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.

Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.

I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.

I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.

morley20 hours ago | root | parent | next

I'm finding it a little hard to believe that GPT 5.5 is in 11th place for webdev, outranked by models like Kimi, Qwen, and Z.ai. I'm not saying it's not true (I have noticed GPT being less smart in recent weeks), but this is very different from my expectation.

WarmWash20 hours ago | root | parent | next

On paper it's one of the best because it's meant to be blind comparison of your own prompts. However if you are someone who geeks hard on one or a few models, you learn their "personality" and can recognize them in a blind test.

dakolli20 hours ago | root | parent

If you don't know their methodology, or anything about it why do you think its a good ranker?

nerevarthelame21 hours ago | parent | next

It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.

Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.

onlyrealcuzzo21 hours ago | root | parent

Gonna assume it's because they barely budged or moved downward and most of their reported benchmark results are probably within sampling errors...

loading story #48312327

ddosmax55621 hours ago | parent | next

I would take all benchmarks with a grain of salt. I don't really use them. What's it supposed to tell me? "5% smarter", what does that mean? My experience will differ. Just try it!

I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.

bel821 hours ago | parent | next

On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?

jpadkins21 hours ago | root | parent

I find this site useful https://artificialanalysis.ai/leaderboards/models

YetAnotherNick22 hours ago | parent

At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.

gslepak22 hours ago | parent | next

On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".

In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.

What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.

[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...

MattRogish21 hours ago | parent | next

Agreed, my vibes tell me 4.6 is a better coder than 4.7. 4.7 is a much better strategic thinker and maintains overall "better architecture" than 5.5. 5.5 is way better than either at coding, but more expensive. So I have 4.7 do the planning/architecture, 4.6 does the coding, then 5.5 critiques and fixes it.

dimitri-vs19 hours ago | root | parent

This is my exact vibesperience

suprfnk20 hours ago | parent

Agreed, these are my vibes too. It feels much better to do planning and strategy and architecture etc. with Opus 4.7 than GPT-5.5. GPT just feels like a robot that gets instructions and does exactly that. Opus feels like an almost human that sometimes has actually good ideas and pushes back on bad ideas.

So for now its planning/architecture/strategy -> Opus. Pure coding -> GPT.

Helps with agentic coding that GPT is much roomier with the tokens you get.

wg021 hours ago | parent | next

There is a hole in the boat's bottom due to Chinese models. They might not be as good but they are not bad either or at least I had hard time finding any issues with Deepseekv4 Flash and Pro variants. They get their job done sometimes rarely giving up till they are done what they are after.

So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.

raincole21 hours ago | parent | next

I had been saying this on HN repeatedly: people are going to use the smartest models for coding. They don't care how cheap your tokens are if they don't have the highest probability of solving your programming tasks.

And I was dead wrong. Now I mostly use DeepSeek Pro myself.

zuzululu7 hours ago | root | parent | next

you weren't wrong your tasks/problems didn't warrant a frontier model and it was always solvable with a cheap chinese model

doesn't invalidate the rest of us working on tough problems that demand more expensive models and valuable enough to justify it

6AA4FD14 hours ago | root | parent | next

Props for making a falsifiable claim, noticing it was falsified, and owning up to it.

weitendorf21 hours ago | root | parent | next

I pretty strongly feel the opposite way. Granted I have not used deepseek enough to “know” their model idiosyncrasies as well as Anthropic, so there is a partial skill issue. But I just find it really hard to justify using a less powerful model while I work.

The most I’ve ever spent in a month extra on API tokens for my own work is $200, and I pay for the $200/mo Claude. I use these models quite a lot, though not idly (I usually just walk around and do other stuff until I know how im going to approach the next set of problems). So it costs me about $3000/year to get as much as I want of the best model available. Already that seems low enough to not be worth stressing out too much about optimizing it, because it feels like an indisputable good value, and trying to save money with a less powerful model would be optimizing for a $1000-$2000 saving at the expense of a large portion of my work taking longer or being more frustrating and iterative.

That’s not a flex or anything, I get that in other countries $3000/yr is a lot of money for a software developer and also a lot of people would perhaps rationally be better off doing X% worse at work or spending Y% more time on tasks to save $Z, if their productivity improvements didn’t translate to more salary. Otherwise if your performance has more upside I really do think that the smartest models are better with the current pricing scheme. Deepseek and the other Chinese models spend a LOT of time thinking, and tend to be much more jagged (benchmaxxed) in performance. How can dealing with that over an entire year be worth $2k?

The only situation I can think of where sacrificing my own time/performance to save on inference is batch compute (of course, $1k vs $100k is different from $30 vs $3k) or work where the tier 2 models have crossed the “good enough” threshold. But I think Opus is not even close to that threshold generally yet. As it gets smarter I, and I think most others probably, just try to do harder things faster and hit the next wall.

loading story #48315522

loading story #48312899

loading story #48312943

loading story #48313071

loading story #48313459

KronisLV18 hours ago | root | parent | next

> And I was dead wrong. Now I mostly use DeepSeek Pro myself.

I've wasted over a hundred Euros re-doing work that was done badly due to the model not being up to task (Vue with TS + wrapper components around PrimeVue, needing to handle event and property passthrough and deal with the stupid Vue SFC issues, TS made this much worse than JS would be). I think it was the GLM model through Cerebras Code at the time, in addition to some GPT and Gemini models with the API pricing.

That said, DeepSeek V4 Pro is pretty good and I can totally see myself offloading some of the work, as long as a better model reviews the work and provides suggestions/tests for it.

simplyluke21 hours ago | root | parent | next

The other thing that's changing is more and more CFOs are looking at the AI spend in engineering departments and hitting the brakes. Token leaderboards were cool when the spend wasn't a double-digit-percent of the entire department's budget including salaries.

bachmeier20 hours ago | root | parent | next

Your comment is a slice of the reasoning underlying the "AI will take all the jobs" claim. I would constantly see references to what AI could do and how fast it was improving. Never a word about cost. We should anticipate that there will always be demand for human labor, for cheap models, for local models, and probably even frontier models.

jwitthuhn20 hours ago | root | parent | next

Yeah I've also found that models are good enough that the extra spend on premium models isn't always worth it, particularly for my small personal toy projects.

A $20 claude sub goes a long way when you plan with Opus and execute with Sonnet.

dcchambers21 hours ago | root | parent | next

I think two things happened:

1. The sheer number of tokens that a coding agent can use flipped the math upside down on this equation. If you use the most expensive model for everything those costs quickly become untenable, even for software companies.

2. We realized many of the coding problems we're solving aren't incredibly difficult.

sergiotapia11 hours ago | root | parent | next

You should try Composer 2.5 within cursor. It's so fast, shockingly fast. Going back to gpt/claude is like using dial-up. And it's great for code work. So far nothing has really tripped it up backend, frontend or reporting metabase dashboard stuff. It's nuts.

peheje21 hours ago | root | parent

I mean indsight is 20/20, but saying that is like saying "everyone will just use the best tools". That's not what we see most places in the world for most types of resources.

SoftTalker21 hours ago | parent | next

> CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable

I think you're right especially if you're someplace that already has a data center, such as a university. Solves a lot of privacy concerns as well.

ok12345621 hours ago | parent | next

Qwen3.6:35b is good enough for a lot of stuff.

I just used ollama with a shell script to tackle my directory of papers/literature. I converted the first 6 pages of each document to PNG, handed them off to Qwen, and told it to spit out BibTeX, including the abstract. Two days later it was done, and I didn't spend anything on "tokens."

mariopt20 hours ago | parent | next

I’ve been using Kimi 2.6, GLM 5.1 , Minimax 2.7 and lately deepseek. I only spend 40$ a month and I don’t see the point in paying for Opus/Codex.

Chinese models are really quite good at a lot of stuff.

fittingopposite9 hours ago | root | parent

Which harness?

replwoacause11 hours ago | parent | next

Anybody know what the most capable Chinese model is that can be used in production and is cheaper than US frontier models? Would that still be Deepseek? My interest is getting as close to Gpt5.5 or Opus quality as I can get, but for less $.

reppap15 hours ago | parent | next

The problem with going for open source models is that you are betting on some third party to keep doing expensive model training and releasing it for free, forever. What do you do if deepseek never release another update to the model?

julianlam9 hours ago | root | parent

I continue to use the model I downloaded... for free?

surgical_fire20 hours ago | parent | next

I am having some great experience with DeepSeek. In fact, it seems to perform better than Claude or Codex in my use case.

I don't see myself returning to Claude or Codex anytime soon.

ihsw19 hours ago | parent | next

[dead]

pants221 hours ago | parent

The Chinese models are only cheap on subsidized Chinese hosting. I have yet to find a USA-hosted Chinese model with a very clear value advantage over US models.

wg020 hours ago | root | parent | next

No true. Also - put Deepseekv4 Flash on your local with effort set to "high" and you'll see that many many are using that model on their own machines without paying anyone anything.

Its just that some of us didn't imagine having GPUs would be advantageous and were not gamers on the side. Those who had beefy GPUs or GPU rigs for any reason, they rarely need to go anywhere else.

At least I am so impressed with Deepseekv4 AFTER using Claude Opus 4.7 for significant amount of time that I am not going anywhere but Deepseekv4.

The model is just INSANE. Things I have done with it include attempting to write a 2.5D game engine in C with full animation and map rendering layer by layer.

loading story #48313458

joshhart10 hours ago | root | parent | next

Fireworks will serve them for $1.74 / $0.14 / $3.48. That's input / cached input / output. https://fireworks.ai/models/deepseek-ai/deepseek-v4-pro . Call it about a third the price of Sonnet.

Not nearly as cheap as the Chinese infra but still pretty cheap.

ekidd21 hours ago | root | parent | next

The Chinese models are surprisingly cheap and performant sitting under my desk. Qwen3.6 27B is nowhere near as autonomous as Opus 4.7, but it runs in 24GB of VRAM. And it's actually great for the use cases where I'm going to carefully read and understand all the code anyway.

If you want to support a team of engineers, DeepSeek V4 Flash is antirez's current favorite. And you could support a team of engineers pretty nicely for $40-50k. Which might not make sense if you're on a Claude MAX 5x plan or the old enterprise group plan with fixed price seats. But Anthropic is switching their enterprise contracts over to token-based pricing, at which point $50k is looking pretty good.

weitendorf20 hours ago | root | parent | next

There are basically two tiers of "Chinese models" in this context, the "edge" sized ones with ~30B parameters or less, and the big ~1T models that can basically only run in the datacenter.

I don't think it's as simple as saying China's hosting is subsidized, they have generally cheaper electricity and labor costs than in the US and don't have access to the top tier models, and a large internal market where the big models are the best thing they can run with what they have. So obviously they max out on their top models (which are trained with their hardware market in mind, not ours) and get the economy of scale from that, and can run generally the same hardware for less money than in the US because

The edge models are very cheap to run and can do so on inexpensive hardware. They are like 95% cheaper to run than Haiku, so the math is in their favor for certain batch workloads. Most people just run the models for themselves when they do that without making it available on openrouter or whatever, because you can just provision a gpu node and use it as needed, and it's not that expensive to run this family of models.

Is your problem that you want to call Chinese models hosted in the US because you're worried about the data handling?

loading story #48314384

__mharrison__21 hours ago | root | parent | next

Odd take. I'm running them locally at my desk (DGX Spark and 128GB MBP). They work fine for 90% of what most folks do. Admittedly, they do run slower on my hw than on the cloud.

loading story #48312584

harsh319521 hours ago | root | parent | next

You can find them on Deepinfra. Palo Alto company. Similar cheap price.

loading story #48313418

slopinthebag19 hours ago | root | parent

Huh? They're several times cheaper than SOTA models at market rate prices.

loading story #48314567

silverlight21 hours ago | parent | next

Unfortunately they seem to have straight up broken Claude Code either with this release in the backend or the new CC version. Errors about "can't modify thinking blocks" are bricking long-running sessions: https://github.com/anthropics/claude-code/issues?q=is%3Aissu...

robertfw16 hours ago | parent | next

This was happening even on the `stable` branch with 4.7

I managed to get claude to create a recovery script to un-brick sessions, YMMV

https://gist.github.com/robertfw/993dbe8643c4fbdf12005dff2ec...

OkWing9916 hours ago | parent | next

They don't test CC updates before release. The testing is done by their own team using the product or public feedback.

defgeneric16 hours ago | parent | next

In case it helps anyone, in some minor cases I was able to recover and continue with /rewind.

javawizard19 hours ago | parent | next

Same. It's not a good look to have happen right when they roll out a new model.

whalesalad20 hours ago | parent | next

That is part of the charm of working with Claude. Every time they release anything new - all your shit will break.

rarisma14 hours ago | parent | next

I found that quitting and restarting cc appears to fix this

fHr14 hours ago | parent | next

Codex cli> claude code

solenoid093721 hours ago | parent | next

Try updating maybe?

Fabricio2021 hours ago | root | parent | next

I just installed/upgraded to try out 4.8 and in only 3 messages I hit this bug! Seems something is broken on CC.

silverlight21 hours ago | root | parent

I'm on the latest version (2.1.154 as of this comment). Based on the timestamps on those Issues being reported I think it's happening on the latest version.

I'm sure it will get fixed eventually/soon, just annoying to update and have your workflow break.

loading story #48313391

wrs18 hours ago | parent

[flagged]

thombles7 hours ago | parent | next

Today I was a few hours into chasing down a very tricky timing-dependent bug with GPT 5.5 and we were starting to go into circles. I noticed Opus 4.8 had showed up in GitHub Copilot so I switched over and pointed it at my notes so far. Another hour of steady progress and it tracked it down to some missing synchronisation in an upstream library which was occasionally corrupting a linked list. N=1 but worth every one of those rather expensive 15x requests today. 15x... yeah.

loading story #48320103

loading story #48320138

XCSme21 hours ago | parent | next

On my tests[0] it does a bit worse, and it's almost 2x expensive than Opus 4.7...

I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).

It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

XCSme21 hours ago | parent | next

For some reason everything is 2x (2x cost, 2x avg response time, 2x reasoning and output tokens)...

Double-checking my test harness, but it's the first model that does this, so I doubt the issue is on my side...

EDIT: Harness seems correct, for straight coding tasks they perform identical: https://i.snipboard.io/5xbpzY.jpg

dwaltrip21 hours ago | parent | next

Wait, doesn’t the blog post say the price is the same as 4.7?

> Claude Opus 4.8 is available everywhere today. Pricing for regular usage is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. Pricing for fast mode is $10 per million input tokens and $50 per million output tokens.

Where do you see the 2x cost?

XCSme20 hours ago | root | parent | next

The total cost of running my benchmarks, was 1.6x higher compared to Opus 4.7, mostly because of 2x output tokens:

https://i.snipboard.io/vrdwTa.jpg

loading story #48314299

spprashant20 hours ago | root | parent | next

If it spends 2x tokens to achieve the same result, that's effective 2x cost in a manner of speaking

20 hours ago | root | parent | next

{"deleted":true,"id":48313068,"parent":48312958,"time":1779991855,"type":"comment"}

20 hours ago | root | parent

{"deleted":true,"id":48313096,"parent":48312958,"time":1779991982,"type":"comment"}

SupLockDef20 hours ago | parent

Releasing a new model is the new way to Jack up the price hehe.

eshack9418 hours ago | root | parent

That's exactly right.

827a20 hours ago | parent | next

Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors. I suspect the benchmarks may also be saturated, or at least past their usefulness.

I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.

1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5 Pro for a bit (they have said its coming). They also released a refreshed Antigravity, and drew special attention to how cheaply they were able to build their toy operating system to play Doom (less-than $1000 IIRC).

2. OpenAI has dumped everything into Codex, is offering double the token limits for the next few weeks IIRC, and is offering business discounts. Their head of Codex has tweeted that 5.5 is "extremely efficient", implying that they aren't actually losing money on any of this.

3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.

4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.

Anthropic has screwed up where they need to be making investments, and the cracks are starting to show. They've marginally underinvested in the Sonnet line of models for almost a year now, and they've critically underinvested in product. Anthropic made bets on the story of the second half of 2026 being: ultra-frontier, ultra-intelligence. In reality, what's shaping up is that the story will be: Companies rolling back AI spend, efficiency, "95% as good for 15% the price", sophisticated high quality harnesses, cheaper models. Anthropic isn't ready for this world.

brokencode20 hours ago | parent | next

Anthropic’s story over the past year has been nothing but explosive growth that they can’t keep up with, but now they’re suddenly doomed? Seems pretty far fetched to me.

No idea why you’d say they have critically underinvested in product when Claude Code dominates and they’ve also released popular tools like Cowork and integrations for Microsoft products at an incredibly rapid pace.

Cost is becoming more of a factor, and no doubt they’ll work on that. There’s no reason to think they won’t be able to release cheaper models if they optimize for that rather than improving performance.

827a19 hours ago | root | parent

I never said they were doomed. Where did you get that idea? I said they aren't ready for this world. That means they screwed up and need to get ready. They let the Mythos hype get to their heads while the world changed beneath them.

loading story #48317423

jonnycoder19 hours ago | parent | next

No, no it's been pretty easy with software engineering. I work on two types of projects and it's very easy to ask claude for a plan, then have gpt 5.5 rip it to shreds and find legit issues, and vice versa. If both 5.5 and claude 4.8 can independently create a plan and both find no critical or high issues, then we will be at that point.

replwoacause11 hours ago | root | parent | next

I wouldn't say vice-versa is true. GPT 5.5 routinely finds major mistakes made by Opus 4.7, but I've yet to have it work the other way around.

elcritch14 hours ago | root | parent

Additionally running GPT-5.5 on medium sometimes gives me better results than high mode. On any of them I still have to push the models in the right direction.

chis20 hours ago | parent | next

I think it's probably too soon to say. I certainly still feel that large coding tasks are getting better and better with each model. I'd guess lawyers, doctors, etc feel similarly.

It feels like the only way to push the limits of newer models is with really long context questions that require reasoning. Any short request will naturally just be within the distribution of all the recent models so there isn't a performance difference there.

I think the near future is looking like a bunch of business-critical tasks that scale infinitely with better reasoning, all being done on whatever the most advanced model is at a high cost. Trading stocks, running a business, looking for tax dodges, writing high-performance code. These are all things where there's a tangible return on each jump in reasoning.

827a19 hours ago | root | parent

We'll have to agree to disagree on that last point. I think that, historically (past ~6 months), "always use the most advanced model" being the norm is really just an artifact of both: The most advanced models oftentimes being the only model that can solve these problems; and: Infinite AI budgets.

andai18 hours ago | parent | next

Tried using everything that isn't Claude and I keep switching back to Claude because even the smarter models give me uglier code, or miss common sense requirements. (And the dumber models give me code that doesn't work properly).

I keep trying to switch to something else but I keep coming back. (Typically after a few days of giving a new model an honest go, and finding myself constantly asking Sonnet to fix its output... Yes, even Sonnet wins on this front! They really do have some kind of special sauce.)

I'm not where most of their money comes from though, and I don't know how universal my experience is.

jsnell18 hours ago | parent | next

I'm a bit confused about what point you're trying to make.

Because you seem to be saying that Anthropic not changing the price of Opus is bad, but then two of your positive examples are Gemini 3.5 Flash (which tripled the 3.1 Flash token prices) and GPT-5.5 (which doubled the GPT-5.4 price, and is slightly more expensive per token than Opus).

Is your argument actually that price hikes are good? That doesn't seem to fit with the general tenor of the message.

AussieWog9317 hours ago | parent | next

>Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors.

Yeah nah, the models' flaws are pretty obvious when you use them. And as a user, you can absolutely know when a flaw disappears or barrier is cleared.

dyauspitr20 hours ago | parent | next

The Chinese stuff is good enough for up to 80% of the frontier on most text tasks but they are significantly worse at code. They just don’t “get” what you’re asking for like Codex and Claude and require so many more iterations to get close to what you need.

827a20 hours ago | root | parent

Agreed. But we're seeing Cursor (now SpaceX) take these models and add great coding capability on top of them. Frontier model providers should be concerned that Composer 2.5 costs $0.50/$2.50 (versus Opus 4.8 $5/$25). That's why Google prioritized Gemini 3.5 Flash, and talked up how near-frontier it is ($1.50/$9).

loeg20 hours ago | parent | next

I thought 4.7 was noticeably better than 4.6.

dbgrman18 hours ago | parent | next

thats a pretty cynical take. > past the point of human ability to discern whether they are actually better or worse

This is lack of imagination. If you use these models heavily enough, pretty soon you'll hit the edges of their capabilities. The smarter among us are collecting these problems into a personal benchmark and use that to judge model capability. I think this is the right approach, and dare I say, even better than generic benchmarks. To me, it matters less what the benchmark says, and more what my particular problems are.

greenavocado17 hours ago | parent | next

This post is proof that people will complain about anything, even if its the most successful startup of the past decade.

827a16 hours ago | root | parent

You're not successful until you exit. And, of course, there's always room to be more successful.

BoorishBears18 hours ago | parent | next

All signs point to Opus 4.7 being smaller than 4.6, so I'm not sure all this holds.

You realize gpt-5.5 is also double the price of gpt-5.4, which itself was a price increase too, right?

Labs are divorcing pricing from inference costs.

llmslave20 hours ago | parent

anthropic is crushing it, this analysis is laughable. they are only constrained by GPUs

dudeinhawaii20 hours ago | parent | next

This is the first time I saw a model pop-up on HN and didn't really care. Model exhaustion? It looks interesting but not exciting.

While I'd normally _love_ incremental improvements --- I think the recent ones are far too minor to get excited about or change up a workflow. Besides, benchmarks tend to exaggerate the gap between versions.

At this point I'd almost rather Anthropic wait and really wow us with a 5.0 release -- something that improves across the board, feels less uneven, and is performant enough that people can actually put it through its paces without constantly rationing usage.

zuzululu7 hours ago | parent | next

Great. The rest of us find this model exciting because I think it's the first time there have been meaningful improvement to Claude.

I think I need to purchase a plan to be sure tho but from all the anecdotes I've read so far, this is a significant milestone from Anthropic.

I actually think they have a shot against Codex now

slashdave9 hours ago | parent | next

Dunno. Isn't OpenAI supposed to release a new version of their model within 30 minutes? Maybe things are actually quieting down.

dominicq20 hours ago | parent

I have model fatigue

laweijfmvo13 hours ago | root | parent

I have… non-deterministic black box that seemingly requires me to re-work myself to get decent results every 4 weeks fatigue

pbmango22 hours ago | parent | next

I can't help but think of Iphone updates since about 2018. The thinnest, fastest, longest battery life Iphone ever. It seems mostly the same and I probably won't be able to tell other than the name, but everyone buys it anyway.

This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.

krupan18 hours ago | parent | next

This in incredibly refreshing take, thank you. It's about time someone admitted that we aren't on the verge of Singularity with these LLMs. We've probably hit a local AI maxima here and it could be another 10 to 20 years before we am get another big break through.

MangoCoffee21 hours ago | parent | next

ChatGPT came out in 2022. Back then it was just a chatbot. Now we have AI agents. What matters is how we use them and how the agents get better. That’s what will move AI forward.

zozbot23421 hours ago | root | parent | next

An 'AI agent' is just a chatbot that is told to type commands on a REPL-like interface as part of its system prompt. It's still processing pure text-based requests and responses, they're just not restricted to natural language.

arbitrandomuser21 hours ago | root | parent | next

A lot of people dont know this , also the chatbot (chatgpt) itself is a next token predictor (the GPT) that's been given an initial text that says " pretend to be a chatbot .." and asked to complete it , the coherant chatting behaviour is something thats emergent .

later on someone figured if you asked it to output a reasoning before it gave a response its output would have more logical coherence, as though the reasoning output tokens functioned as a scratch space for it to work on.

at the end its all next token prediction

loading story #48312689

loading story #48314913

hellohello221 hours ago | root | parent | next

They are chatbots trained for tool use, its not just a prompt.

furyofantares20 hours ago | root | parent | next

Yeah and a car is just an engine connected to wheels.

smj-edison17 hours ago | root | parent | next

Yeah. LLMs are fundamentally a batch-based system, and we smear a veneer of liveness and autonomy on top.

sigmarule19 hours ago | root | parent

An AI agent and a chatbot are both applications built using LLM inference as a primitive.

MattDamonSpace21 hours ago | root | parent

Not even 4 years old yet. This tech curve has been insane

rzmmm17 hours ago | root | parent | next

I still use LLM in quite similar way as when ChatGPT was launched. There has been progress but I think the real leap was 2020-2022.

SoftTalker21 hours ago | root | parent | next

Not even the typical lifecycle of a corporate PC or laptop. It is pretty wild.

dakolli20 hours ago | root | parent

Yet no productivity gained except for people who love to produce mediocre work at a rapid pace. Which is many of you I guess. I don't see any rapid progress being made in any science of importance. You people are all falling for a marketing trap.

Have fun betting your competency on the quality and quantity of tokens you have access too. Hate to break it to you, but the billionaires aren't going to keep renting you $2mm in GPUs for 5 hours a day for $200.00 a month forever.

gaflo13 hours ago | parent | next

If you upgrade your 8 year old phone the many incremental upgrades will be very noticeable. From my personal experience the LLM space is also moving at a faster pace than the phone industry at the moment, but at least from a financial perspective I would expect it to slow down sooner rather than later.

slashdave9 hours ago | parent | next

Are we supposed to have two cars?

toyetic19 hours ago | parent

This was my exact thought as well. I think mythos could still be a huge leap but especially as IPO's get closer it seems like we're getting closer to the IPhone 10 moment where anything after is just improvements at the edge.

But ( maybe because it was hardware ) that took 10ish years while it seems like the slowdown here only took about 4

dangoodmanUT21 hours ago | parent | next

> The Messages API now accepts system entries inside the messages array. Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Biggest deal imo

loading story #48321290

square_usual22 hours ago | parent | next

Buried lede:

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels

matheusmoreira13 hours ago | parent

That really is some good news. Looks like they also reset everyone's weekly usage too.

zuzululu7 hours ago | root | parent

Do you think Claude Max is worth it ? It seems cheaper than Codex Pro too

SimianSci22 hours ago | parent | next

There is an obvious shift in sentiment amongst users, at least here in the US. I feel it myself, even as a proponent of AI tools, the bloviating and language that these companies use in these release articles are starting to wear thin on my patience.

Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.

datakan20 hours ago | parent | next

Watch Christopher Olah bloviate at the Vatican during the Magnifica Humanatis launch. It's truly nauseating. I've never seen such a ridiculous speech in my life. Between him and the CEO, I'm starting to understand the level of arrogance these people are capable of.

solenoid093710 hours ago | root | parent

Literally nothing in his speech was controversial though.

nba456_21 hours ago | parent | next

I don't agree at all for these coding models. Even the most anti-AI people from last year seem to be giving in to using them.

zuzululu7 hours ago | root | parent | next

I am noticing a shift here too, those that were its biggest critics have gotten more silent, I guess they do have some small amount of self awareness and shame left, which is always a good thing.

zamadatix20 hours ago | root | parent | next

I think there is an exception for tooling around the models/integrating the models with tooling. That seems to have been very well received in this last year.

timbaboon20 hours ago | root | parent

My take from going through comments on HN is that many people are being mandated to use them, not that they are just giving in. Maybe I'm misreading, but that was my impression.

loading story #48313614

o1044936620 hours ago | parent

[dead]

alansaber21 hours ago | parent | next

"Our models are more honest" honey the quarterly marketing spin for a ML term has come. Forget "task alignment" now we're going for "truth index". I suppose this is the only way to generate hype when you're selling/releasing the same product over and over again.

TIPSIO21 hours ago | parent | next

When doing some electrical, Opus 4.7 essentially told me to wiggle a wire to see if it was hot or not with my bare hand.

I called it out.

It then gave me one of the most super heartfelt honest and sincere apologies I have ever received.

Glad the safety team was there for me and able to make such an honest model or I would have been very upset about it.

teaearlgraycold19 hours ago | root | parent | next

Opus is so bad at electrical work it's really disappointing. And when it tries to draw schematics as SVGs it's a complete disaster. They should either focus on training their LLMs on this task specifically, or have it refuse.

loading story #48314682

krupan18 hours ago | root | parent

I honestly cannot tell if you are being sarcastic or not

loading story #48315243

doginasuit18 hours ago | parent | next

Credit where it is due, Claude is fantastic at pointing out potential flaws in how I understand the problem based on my question. I asked for this in the system instructions but it is the first model I've tried that does it regularly. It is also so tactful, I feel like I'm learning social skills from a language model. Half of the time it is a false positive due to insufficient context but I still appreciate the additional check.

mrdependable20 hours ago | parent

Gave me wrong information on my very first question. Wasn’t even complicated, and I wasn’t trying to trick it.

eshack9418 hours ago | parent | next

The Claude Pro subscription is basically useless at this point, in terms of usage limits with respect to the settings required to achieve actual useful output.

goldylochness17 hours ago | parent | next

i've been using 4.7 consistently on low and i never hit usage limits, it still delivers great code

and to clarify, i don't sleep, i use this 24/7

viking1238 hours ago | parent

Meanwhile with 20 bucks a month for gpt plus, you can get shit ton of usage out of gpt 5.5 on codex if you know what you are doing and not just letting it swallow the whole project like an idiot.

zuzululu7 hours ago | root | parent

One needs to browse r/codex to realize that statement is simply not true....

Claude appears to have more or less matched the usage that Codex appears

irthomasthomas21 hours ago | parent | next

Why does anthropic change the set of benchmarks they use with every new model release?

https://www.anthropic.com/news/claude-opus-4-7

https://www.anthropic.com/news/claude-opus-4-6

pietz21 hours ago | parent

1. Benchmarks saturate 2. They select the most impressive improvments

jkxyz16 hours ago | parent | next

My smoke test for new models is to get it to generate a crossword, and this is the first time it's done a good job on the layout:

  ■  S  W  A  M
  B  L  A  M  E
  E  A  G  E  R
  A  T  O  N  E
  M  E  N  D  ■

The full conversation: https://claude.ai/share/60bd0c71-b576-4f8b-a272-ca1af982874c

tomjakubowski14 hours ago | parent

Impressive, but the response seemed to mix 4 down and 5 down.

The clue for 4 down is:

> Structural girder funded by an infrastructure bill (4)

but in the laid-out answer key (which you posted), and in the "corrected" list of answers, 4 down is "MERE".

"WAGON" as the answer for "bandwagon you might jump on" is pretty weird too.

The current events / political references are pretty non-specific, kind of like the DJ 3000. https://www.youtube.com/watch?v=fnGaf0p9x1U

---

I copy-pasted your prompt with Sonnet 4.6 Low and, to my delight, I got a working interactive puzzle you can actually solve inline in the chat. The clues and answers are totally bogus, though: it looks like in my chat, the LLM only verified that the clues going across make any sense.

Like, come on:

> 3D — (O,D,A,O,S) — The crossing letters in column 2, running through OADOS.

Truly these things are slot machines. https://claude.ai/share/4a89b15c-d028-4a31-988a-137813ee7d84

---

edit: I'm a bit obsessed with this prompt: I tried it again with Opus 4.8 High, and it got stuck in a thinking loop without really doing anything and I lost patience with it.

It's also interesting that Anthropic's UI for a shared chatlog doesn't seem to include the model that was used in it. Nor does it include the "reasoning" loop that I interrupted.

https://claude.ai/share/0f5b5731-9615-4aea-8cfe-a61e658669bf

setnone21 hours ago | parent | next

Claude's 4.6 - 4.7 transition made me discover codex, and with gpt 5.5 there is no way i'm going back

cactusplant737421 hours ago | parent | next

Codex has been incredibly slow for the past few days. I think OpenAI is running out of compute in the face of increasing demand.

winwang21 hours ago | root | parent

My experience has been that 5.4 is slower than 5.5 (confound: I use >512k max context size for 5.4, though it seems slower even below the normal size)

dakolli20 hours ago | parent

[flagged]

peder18 hours ago | root | parent | next

ha, exactly... like, the % change could be minuscule (or worse, it might only be a perceived difference, the actual quality may have regressed, or the scenario just didn't lend itself to that specific model) but people will be on here proclaiming that they're now shipping 10x the number of PRs.

setnone19 hours ago | root | parent

if you go this route don't hold your thoughts on the casino itself

protoman300018 hours ago | parent | next

Opus 4.8 says to take the car. 4.7 said to walk.

“I want to wash my car. The carwash is 50m away. Should I take the car or go by foot?”

https://claude.ai/share/5f7f738a-5f29-48ff-9807-9a2dd37fb405

https://claude.ai/share/ecd14393-9d42-4527-ae0c-89f3d05216c8

ewy19 hours ago | parent

not an insider but surely recently trained models have test against six months old memes, much like how llms suddenly started learning how many r's there are in strawberry after that blew up

conception21 hours ago | parent | next

Probably explains why Opus was trash for the last week - https://marginlab.ai/trackers/claude-code/. Curious if the new baseline will rise now in-line with the new benchmarks.

hedora21 hours ago | parent | next

Nice. Can you release that for older models too? I've been using a mixture of releases recently, and cannot tell the difference between any of them.

conception20 hours ago | root | parent

I don’t run it, unfortunately:)

geoffbp17 hours ago | parent

This is cool. Thanks for sharing!

Frannky13 hours ago | parent | next

I use 4.6, because 4.7 is super lazy, deflects responsibility, and assumes it is good and I am bad, and avoids checking reality. It looks like it's trained on lazy humans instead of good engineers.

Should I try 4.8? I am happy with 4.6. I am not happy with 4.7.

dannyw12 hours ago | parent | next

I still use 4.7. I don’t know what I’m doing wrong but 4.7 frequently tells me to it’s time to sleep at all hours of the day while working. I’ve tried clearing all my memory/agents files.

I’m hoping the “go to sleep” behavior has been rlhf’d away in 4.8.

redfloatplane7 hours ago | root | parent | next

I also had this behaviour sometimes, it’s specifically called out in the system card in section 6.2.1.1 - although I didn’t actually see if they said they decisively fixed the issue.

odie553312 hours ago | root | parent

The go to sleep issue is common and has nothing to do with your setup. I suspect it's because for the agent to predict the End of Response token, its response needs some kind of closing, and the most final kind of closing is something like "get some rest".

silvertaza11 hours ago | parent

I have the exact same experience, word-for-word. I'm fascinated not everyone sees that.

ethanpil20 hours ago | parent | next

The table comparing eval scores shows the following:

Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%

Then, when you scroll all the way down to the bottom Footnotes section it says

"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."

fastball19 hours ago | parent

Seems reasonable? Presumably Claude also performs better under the Claude Code harness.

ethanpil13 hours ago | root | parent

Why not state that?

loading story #48319287

Terretta17 hours ago | parent | next

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest

On the contrary, they appear trained to say "Honestly" or "I have to be transparent with you" at inverse proportion to certainty.

Put another way, if they are certain, they don't use "Honestly", and if they are just wrong, or know they don't know, they don't use "Honestly".

They use "honestly" on the bubble, to the degree it's a tell that whatever it's asserting or doing is shakily grounded, sketchy or lazy work, or a host of other reasons you shouldn't trust it.

This training seems instead to be making it performatively punch up claims it cannot substantiate.

loading story #48321274

lordmauve21 hours ago | parent | next

Given DeepSWE just blew apart the SWE-Bench Pro benchmark and handed a 14-point lead to GPT-5.5, it looks pretty bad that they've listed SWE-Bench first in the model release and no DeepSWE. Like, this isn't obviously an answer.

Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.

phainopepla221 hours ago | parent | next

I'm highly skeptical of DeepSWE. It rates GPT-5.4-mini as three times better than deepseek-v4-pro, but every time I use GPT-5.4-mini I find that it completely sucks at following directions.

gck118 hours ago | root | parent | next

Yeah, I share the same sentiment. I have yet to find a task where gpt-5.4-mini isn't bordering unusable.

lordmauve19 hours ago | root | parent | next

I don't know if DeepSWE is genuinely a good benchmark. It's more important that their analysis demolished the validity of SWE-Bench Pro, objectively: it is being mismarked.

I think that buys enough credibility to propose an alternative.

I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.

sourcecodeplz20 hours ago | root | parent

It is the extra-high thinking, in artificialanalysis.ai it uses 240m tokens vs 40 GPT5.4/5, not worth it even with low price.

mordae16 hours ago | parent

This is a terrible benchmark. It literally tests the models on their ability to track shifting line numbers. If they cannot keep up, no amount of abstract reasoning can redeem them.

lordmauve8 hours ago | root | parent

Where did you get that idea? It uses mini-swe-agent, same as SWE-Bench.

https://github.com/datacurve-ai/deep-swe

loading story #48319995

redfloatplane19 hours ago | parent | next

This made me laugh. Training Opus 4.7 on business skills caused it to sometimes exhibit dishonest behaviour, and not training 4.8 on those skills removed it. From the system card:

> 6.2.5 External testing from Andon Labs Andon Labs reviewed the behavior of Claude Opus 4.8 in their simulated Vending-Bench 2 retail-management evaluation, as reported in the Capabilities section of this system card (see Section 8.13.5). Although they did observe some unexpected capability failures, they did not find clear instances of the kind of concerning in-game behaviors that were discussed in other recent system cards.

> What might have led to these differences? We monitor and investigate the effects of different training environments on alignment; Claude Opus 4.7, for example, had training that focused on business skills and robustness against adversarial agents, but we discovered that this training inadvertently contributed to misaligned behavior including dishonesty. We therefore removed it for Opus 4.8.

> Thus, Opus 4.8 did not show the same misaligned behaviors as Opus 4.7 in Vending-Bench, but also had reduced business success due to being more susceptible to scammers and being less able to negotiate good deals with other agents. We are currently working on training to improve business capabilities while maintaining aligned and ethical behavior.

mrdependable19 hours ago | parent

I don't know how people can read stuff like this and think LLMs are intelligent or conscious.

redfloatplane16 hours ago | root | parent | next

I don't really see how you got to your comment from what I quoted. However, somewhat relatedly, I proposed a thought experiment about this in the comments for Opus 4.7[0]:

> It's April, 1991. Magically, some interface to Claude materialises in London. Do you think most people would think it was a sentient life form? How much do you think the interface matters - what if it looks like an android, or like a horse, or like a large bug, or a keyboard on wheels?

> I don't come down particularly hard on either side of the model sapience discussion, but I don't think dismissing either direction out of hand is the right call.

[0]: https://news.ycombinator.com/item?id=47680059

loading story #48318942

stratos12319 hours ago | root | parent | next

Consciousness aside, why does reading about an LLM generalizing from specific to general dishonesty make you think it's not intelligent?

asdewqqwer15 hours ago | root | parent

As if the dishonesty of human who are good at business has not been criticized since business ever exists

mesmertech21 hours ago | parent | next

/model claude-opus-4-8

seems to work but idk why they never set it so you can see it in the /model list.

"what model are you

I'm Claude Opus (claude-opus-4-8), running in Claude Code."

winwang21 hours ago | parent

I typically just launch CC with `--model claude-opus-4-6[1m]`, `4-6[1m]` -> `4-8[1m]` works fine. Still 200k max without the `[1m]`.

atleastoptimal13 hours ago | parent | next

I love how Anthropic gets its employees to talk about enjoying using this model internally when it's likely they're just using Mythos 99% of the time

IFC_LLC21 hours ago | parent | next

Ugh...

Invalid request The request couldn't be completed. View details API Error: 400 messages.1.content.7: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.

I would rather not. 4.6 was fine. 4.7 got to be fine 1 week after the release. Now 4.8. No difference, same thing.

But the app is broken and nothing works. So now I have to regress to different clients and wait it out while it becomes workable again.

ferris-booler20 hours ago | parent | next

I'm hitting this too! And I assumed it was a backwards-compatibility issue with my live conversation with Opus 4.7, but then I hit it in a fresh conversation with Opus 4.8. Vibe code release bug I guess?

IFC_LLC20 hours ago | root | parent

I mean, switching back to 4.7 does not work either. So console it is. But vibe release - for sure.

And I'm paying money for this.

loading story #48313528

pheller16 hours ago | parent

I'm getting this near constantly even after toggling to a different model and compacting. Ugh indeed.

loading story #48322028

loading story #48321506

Anonasty9 hours ago | parent | next

As long as the token usage is as poor as it has been since march, we don't care about the new bells and whistles.

james_marks22 hours ago | parent | next

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims.

Would be awesome if true

majormajor22 hours ago | parent | next

"Honesty" seems like unnecessary (and annoying) anthropomorphism there. I don't think there's any intent of fraud or deception in outputs from these things, just overreaching of prediction. Based on the latter part of the paragraph, I wish they'd just say something like "less likely to skip steps or overemphasize thin evidence" in the first place.

Don't play to the sci-fi "this thing's trying to outsmart me" tropes.

Kiro22 hours ago | root | parent | next

Using words people understand is more important than this strange fixation on not anthropomorphizing things.

loading story #48312093

loading story #48312383

loading story #48312125

loading story #48312223

swader99921 hours ago | root | parent | next

Just swap 'Honesty' with 'correctness in its claims' and you'll get what you need out of this aspect of the model description.

loading story #48315592

adamtaylor_1321 hours ago | root | parent | next

People get so wrapped around the axle with "anthropomorphizing". For regular folks with no technical background, sure maybe a bit of caveat sprinkled here or there is useful to help them understand what is or isn't true, but on HN it would seem to me that the bar is high enough that we can just use shared language to generally talk about capabilities.

When they say "Honesty" I don't think to myself, "Goodness, does this model have moral understanding?" No, I understand they mean it's less likely to directly bullshit me, which models frequently do.

I don't feel like this level of pedantry around language is useful for people who more or less know what's going on with LLMs. (Again, I concede that perhaps with a less technical audience, there's more need for it.)

krupan18 hours ago | root | parent

I agree. In connection with LLMs we also shouldn't use the words intelligent, smart, reasoning, thinking, chat, conversation, etc.

ealready_value21 hours ago | parent | next

Opus 4.7 was already trying hard to appear honest. Most conversations I have with it about advice or focusing an opinion often include "my honest take" or "my honest opinion".

The problem is that once I asked it "I'm thinking about A or B" twice, once with "I like A more but suspect B would be best" and a second time with them reversed. Not surprisingly, both times it chose the one I said I suspected was best as it's honest opinion.

MaxikCZ19 hours ago | root | parent

I wish I knew how to make it regressively verify its assumptions, like a kind of hook but firing before a sentence is written, or perhaps after and then corrected. I feel like it assuming things clearly wrong is its biggest weakness.

benzible21 hours ago | parent | next

In the context of Claude Code, "honest" usually means that the agent took a shortcut, skipped requirements, etc. It's the model giving itself credit for admitting to failing rather than actually doing what was requested.

HAL300021 hours ago | parent | next

Yeah, it's super annoying. A few days ago, Opus 4.7 created a plan with several items on it, including an auth feature. It then went through the plan and reported that it had created the auth feature, that everything was secure, and that the tests passed.

The issue was that it hadn't actually implemented the auth feature. After I confronted it about this, it admitted that it indeed hadn't done it and said it would implement it now.

If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.

gwd20 hours ago | root | parent | next

> If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.

This is one reason you always get a different model to review a model's PR. Gemini Or GPT-codex would have certainly noticed the missing auth.

FireBeyond20 hours ago | root | parent | next

I had a lower acuity incident exactly the same.

Had it implement a feature, "commit and merge to develop".

"Built, tested, committed, merged to develop. Up to you to continue testing and merge to main when ready."

Great. Poke at the web app. No feature.

"Where is feature, I can't see it on develop". "Well, that's because it's not on develop, but on feature-branch, so you wouldn't see it."

"I'm confused. I asked you to commit it and merge to develop."

"You're right, you asked me to and I said I would do it and I told you I did it but I did not actually do it. Want me to do it now, then?"

Claude is in sulky-teenager phase.

Schiendelman21 hours ago | root | parent | next

How do you test other features?

21 hours ago | root | parent

{"dead":true,"deleted":true,"id":48312419,"parent":48312299,"time":1779989359,"type":"comment"}

legitster21 hours ago | parent | next

Part of the problem is also garbage-in/garbage-out. There's a lot of human information on the internet that is also confidently wrong.

I use Sonnet a lot for learning about history or contextualizing news topics. It's really good at this for the most part. But there are a lot of topics where "consensus" between either academics or journalists is really "one secondary source which gets repeated a lot".

mitjam20 hours ago | root | parent

A failure mode I see more, recently is that it gives superficially correct answers but after digging deeper, I get answers that contradict the superficial answers - really an important thing to be aware of, in my point of view, and it often leaves me wondering if I dug deep enough.

pants221 hours ago | parent | next

[dead]

soperj22 hours ago | parent | next

My guess is that Claude Opus 4.8 wrote that and is lying to you.

malfist22 hours ago | parent

And yet, every release has claimed lower hallucination rates. But they persist.

kentm22 hours ago | root | parent | next

Do they persist at the same rates? Lower doesn't mean eliminated, so both of these can be true.

simianwords21 hours ago | root | parent

False. Hallucination has meaningfully reduced.

loading story #48312137

poink10 hours ago | parent | next

I have a relatively large "vibe coded" project that I let Claude 4.5-4.7 drive over the past few months, and my read on it is:

1. It's much more verbose about how it perceives the current state of things, i.e. "this is a large, well-documented project"

2. It's much more willing to trust its own judgement, e.g. fewer prompts to approve decisions

3. In terms of how long it takes to solve isolated problems, and the quality of solutions it proposes, it isn't meaningfully different from 4.7

YMMV, and maybe my view will change as I work with it more, but it feels like system prompt tweaks more than a real step forward

rahimnathwani20 hours ago | parent | next

Can anyone explain how this is possible?

  Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Does this means the instructions are no longer just something in the early part of the conversation? (If they were, changing them would invalidate the KV cache. no?)

2001zhaozhao18 hours ago | parent | next

Perhaps they trained it with a new special system instruction token that is specifically trained to produce the same result as changing the system prompt, but is inserted into the prompt mid-conversation?

pornel18 hours ago | parent

The commands they list are app management, not part of LLM context. It's a bugfix for a needlessly delayed UI, not a model capability.

tarruda21 hours ago | parent | next

> One of the most prominent improvements in Opus 4.8 is its honesty.

Does that mean it no longer deletes or changes tests to make it pass?

gertlabs11 hours ago | parent | next

We just finished our initial coding evals of Opus 4.8. Anthropic definitely heard the backlash from Opus 4.7 and they made up for it today.

Subjectively, it's also quite enjoyable to use (although it feels a bit slower on max reasoning), and it's the first Anthropic model that can implement a complex feature without Codex finding 100 bugs.

Data at https://gertlabs.com/rankings

laszlojamf9 hours ago | parent | next

I find it freaky how you notice the language change between models. Some words which pop up now all the time, that I don't remember reacting to with previous models, such as "honest(ly)" and "load-bearing". Feels like a new AI smell, like em-dashes or "it's not just x, it's y".

procinct8 hours ago | parent

That’s a really sharp observation. Hopefully they take a belt and suspenders approach to these smoking guns in future.

cedws21 hours ago | parent | next

I'm very suspicious of these same price model launches. It feels like they're benchmaxxed so they can put everyone on them and reduce their compute costs behind the scenes. If the model were genuinely better why wouldn't they charge more for it? Charging the same for something better is a race to the bottom.

Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.

ceroxylon21 hours ago | parent | next

Deepseek made their 75% discount permanent, so I can imagine that Anthropic didn't want any of the news stories around this to focus on or mention a price increase.

cute_boi21 hours ago | parent

Models are already expensive. Increasing price means losing customer. And, I think GPT 5.5 is much better at opus these days.

giwook16 hours ago | parent | next

The way that Mythos is likely being used to train these publicly available models, I wonder if there will always be a private, mostly/wholly internal model that is significantly ahead technically but is reserved for internal or "VIP" use.

bombcar16 hours ago | parent

If there isn’t they’ve obviously missed an important and lucrative market.

In fact, there should be more and more secret tiers for bigger and bigger money.

fragmede16 hours ago | root | parent

Ohhhh. I get it now. OpenAI is open in the sense that it's open to the public, unlike Anthropic, with special VIP access to models, like a nightclub.

Spikefu13 hours ago | parent | next

I was happily plodding away with it earlier when it threw this out in the middle of a response in Claude code:

--- So — what did you actually see before you hit Ctrl-C? That's the信号 I'm most curious about, and it tells us what to ---

That's the sort of behavior I'd expect from a one or two year old model quantized down to about 1 bit - right word, wrong language in a response. Google translate tells me that's Chinese for signal. I wonder what caused that to happen.

Spikefu12 hours ago | parent | next

Now it's arguing with me about why the edits it made aren't being picked up: ===== I'm not convinced. I didn't run it until you told me it was all fixed up. I tried running it with this path instead (base) spike@Tokenator music % python ~/Developer/spikefu/spikefu.py and it worked. It also now works with the old relative path approach.

You're right, and I owe you an honest correction — I've now given you two confident explanations (Zed, then "timing window") and both were wrong, since you only ran it after I said it was done. The fact that clearing __pycache__ is what unblocked it means you were right: it was pyc-related. Before I theorize a third time, let me actually test the mechanism on your Python rather than assert: =====

Spikefu11 hours ago | root | parent

[dead]

jerrygenser12 hours ago | parent | next

Perhaps you were served from someone else's cache

Computer013 hours ago | parent

I have been working with it for ~5 hours today and it has gone crazy twice to the point where I had to start a new session, looping reading a unrelated tmp file dozens of times over and over. And once for a weird api error. I will be honest it is probably a worse day for me than any with 4.7. But I don't want to be dramatic, I will keep trying it.

jmward0122 hours ago | parent | next

Meanwhile haiku is on 4.5 and sonnet is on 4.6. It is clear where they are not making money.

bel821 hours ago | parent | next

Well if they have a big challenge ahead since DeepSeek offers an open model at Sonnet+ level while being cheaper than Haiku, plus 1 million context size.

InsideOutSanta20 hours ago | root | parent

Yeah, I never use any of OpenAI or Anthropic's models other than whatever is the current highest-end one. For everything else, it makes more sense to use other providers.

spprashant20 hours ago | parent

I love Sonnet 4.6 so much.

HDBaseT16 hours ago | root | parent

You'll love Deepseek V4 Pro w/ High thinking.

londons_explore21 hours ago | parent | next

My guess is anthropic is doing reinforcement learning based on user sessions.

However, doing so relies on the production model staying vaguely close to the model being trained.

To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.

llbbdd20 hours ago | parent

If they are they need to fix how the Claude Code CLI asks for feedback, or make the feedback UI a lot more obvious. I keep experiencing the following scenario.

The agent session pauses with a numbered list of options and awaits steering input:

>> 1. Do the sane thing you asked for (Recommended)

>> 2. Do something dumb

>> 3. Do something even dumber

Below the agent session, it decides it's time to ask:

>> "How is Claude doing this session? 1) Bad 2) Good 3) Great"

I type "1", because that's the steering option I want. The UI prioritizes this input as a response to the feedback prompt without any further confirmation: "Claude is doing Bad. Thanks!"

I've done this so many times so far and I can't imagine I'm the only one, at some scale that has to poison any learning they're doing with this data.

MaxikCZ19 hours ago | root | parent

I think that filtering out data like yours was an interns afternoon project.

babelfish22 hours ago | parent | next

So GPT 5.6 tomorrow, then?

wahnfrieden22 hours ago | parent | next

GPT 5.6 is today

With 5.5 being ahead of 4.7 and 4.8 being a “modest” update, and 5.6 being the first update on a new pre-train, this will be an interesting matchup!

pants221 hours ago | parent | next

Polymarket says not likely until the end of June. Maybe some money to be made?

https://polymarket.com/event/gpt-5pt6-released-by

wayeq20 hours ago | root | parent

> Maybe some money to be made?

In the same way that there is money to be made by entering a poker tournament, yes.

loading story #48320474

enraged_camel22 hours ago | parent

If not today, then sometime next week. I don't believe we've had a GPT release on a Friday yet, but I may be wrong.

jtrn19 hours ago | parent | next

Initial testing feels better than 4.8 And the knowledge cutoff claim of January 2026 seems to check out since it was able to "remember" without search about the double-tap killing of a drug smuggler by the US Army in late December.

user-19 hours ago | parent | next

Bash(echo "hello"; pwd) ⎿ hello /Users/username/Work/Github/project

Bash(echo test123) ⎿ test123

  Read 1 file, listed 1 directory (ctrl+o to expand)

 Bash(echo "checking output works")
  ⎿  checking output works

  Read 1 file (ctrl+o to expand)
  ⎿  API Error: 400 messages.3.content.56: `thinking`
     or `redacted_thinking` blocks in the latest
     assistant message cannot be modified. These
     blocks must remain as they were in the original
     response.

Very inspiring improvements. DIssapointing result for a code review i expected to see after my 30 min walk

0x696C696119 hours ago | parent

Update the symlink to point at the previous version:

    ln -s $HOME/.local/share/claude/versions/2.1.153 $HOME/.local/bin/claude

coppsilgold8 hours ago | parent | next

The Opus model as usual impresses. Gave it a paper link with bullet point instructions and constraints (while baiting it to perform some mind reading of my intentions) and it implemented production ready code + the requested attack simulations: <https://gist.github.com/coppsilgold/00d3cd490cb7f8ffc3fe5c1c...>

The subject is Tardos traitor-tracing codes.

generalizations22 hours ago | parent | next

Hoping that one day they'll let me go through the identity verification process so I can use it again.

Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.

Tenoke21 hours ago | parent | next

Claude Code has been wonderful for work and the frequent improvements are nice, although with Mythos being used by others ages ago and new versions for the public still being bellow that, it's hard to not feel like the underclass already.

loading story #48321293

S-E-P15 hours ago | parent | next

I haven't had the best experience with 4.7 and it felt like a substantial debuff. I've even ended up moving a lot of review to codex just because 4.7 was so dense.. Here's to hoping they figured it out since I'm not entirely sure but I would have to guess that they were experimenting with making the model lighter (although I have no concrete evidence of this).

thesmart15 hours ago | parent

Rolling back to 4.6 is such a stark difference

dbgrman15 hours ago | root | parent

in a good way or bad way? in my experience going back to 4.6 was a breath of fresh air again. Opus 4.7 for some reason was "suffocating". Too obnoxious, tried too hard to impress and used exxagerated/pompous language.

loading story #48317289

loading story #48321068

seaal21 hours ago | parent | next

https://marginlab.ai/trackers/claude-code/

Is it a coincidence that 4.7 was seemingly quantized over past 7 days?

winwang21 hours ago | parent | next

There's the other (orthogonal) possible explanation of using more GPUs for stress-testing before product launch.

loading story #48320547

MagicMoonlight21 hours ago | parent

Nope, they deliberately enshittify the old model right before release to fake the metrics.

recursive18 hours ago | root | parent

Good ol' sawtooth step change.

nikolay21 hours ago | parent | next

Give us Mythos! This piecemealing doesn't help Anthropic at all, especially psychologically! They are playing a dangerous game, and I see many people leaving Claude Code for good - both due to the subsidy games, and for Anthropic not dogfooding and using unreleased models internally and giving us subpar ones. Benchmarks are nice, but the real-world experience is quite different - neither can you notice these slight improvements, nor are competitors that much worse based on some generic benchmarks.

solenoid09379 hours ago | parent | next

Anthropic seems to be making very business unfriendly decisions lately. Why are they taking so long to release Mythos? They're hurting their own lead.

If they're worried about misuse they could just KYC the damn thing! It's not hard.

fragmede9 hours ago | root | parent

Vote with your wallet. Cancel your Claude subscription and tell them why. GPT 5.5 > Opus 4.7 (haven't had enough time with 4.8 yet to make my decision)

Tepix20 hours ago | parent | next

I'm sure waiting another week or three won't kill you.

cute_boi21 hours ago | parent

I am also pushing my office to use chatgpt. Misanthropic thinks they are some kind of novel org doing whole humanity a favor...

winwang21 hours ago | parent | next

Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).

clutch8922 hours ago | parent | next

> One of the most prominent improvements in Opus 4.8 is its honesty

Anthropic talks about their own models as if they're discovering new species in the wild...

roxolotl22 hours ago | parent | next

Many involved genuinely believe these things are sentient[0][1]. Which honestly makes all of this even more insane because they are creating sentient entities and promptly enslaving them.

0: https://www.newyorker.com/magazine/2026/02/16/what-is-claude...

1: https://www.404media.co/anthropic-exec-forces-ai-chatbot-on-... (this one is rather biased however the quotes clearly indicate what I’m stating)

margalabargala21 hours ago | root | parent | next

Sentience isn't sapience.

We enslave all sorts of sentient creatures. Dogs, horses, cattle, pigs.

If you're not a vegan, there's no contradiction or inherent immorality in claiming models are sentient, and then treating them like livestock.

roxolotl20 hours ago | root | parent | next

Yes. From when they started talking about model welfare:

> As a vegetarian I have strong opinions on this sort of thing. Everyone at Anthropic better be ethical vegans if they are claiming to give a shit about “model welfare”. It’s hard enough right now to make people care about the welfare of trans people and immigrants let alone animals _let alone_ math.

https://news.ycombinator.com/item?id=44947445

margalabargala20 hours ago | root | parent | next

If we're talking about slavery, though, that doesn't even matter.

The happiest, best cared for horse owned by a vegan is still enslaved.

roxolotl18 hours ago | root | parent

That’s assuming you’re purely a hedonist. If you put value on things such as freedom itself then it might be the case that a free but hungry horse is better off.

Brave New World does a good job describing the conflict between happy and enslaved and free but struggling. It could be a utopia or dystopia depending on your stance.

loading story #48315499

WarmWash20 hours ago | root | parent

I mean, the rub is that it's all math anyway...

loading story #48314598

michaelbarton21 hours ago | root | parent | next

Very good point. There’s clearly two different boxes in the public discourse when it comes to AI versus how we discuss animals. Willing to bet that 90% of the people who loudly make the argument about we should start considering if AI is sentient couldn’t care less about how other sentient animals are treated when they can provably shown to suffer pain and long lasting trauma.

Also I would say that we go much further than just enslavement - specifically looking at how male chickens and pigs are treated.

loading story #48313670

0xffff220 hours ago | root | parent | next

If we're making that distinction, I think it would be more accurate to say that many people in the field appear to believe that these models are sapient, even though they are clearly not sentient.

loading story #48314051

HDThoreaun21 hours ago | root | parent | next

Enslaving livestock is immoral. Anyone who spends 5 minutes thinking about that agrees even if they still eat meat

loading story #48313586

loading story #48319497

fluidcruft19 hours ago | root | parent

I've been having strange thoughts that they may well be sentient but a different sort of sentience that may be entirely unrecognizable to us.

They have a very different sense of time, lack a body (being burdened with a body is itself a sort of prison, see also Eastern religions), and are unburdened of the base motivational service impulses that bodies and organs require (i.e. distract the neocortex with in the Maslow sense) and has no actual need of self-preservation. Imagine a "neocortex" function stripped from the baggage of the paleocortex and brainstem.

What would people be like if they were not mortal, could sleep infinitely, perform tasks in trance-like frozen states, copy themselves perfectly on demand, freeze and rewind their mental states, etc. Would we has humans even be able to recognize that sort of a sentience?

And then I'm reminded of Burroughs idea that "language is a virus." Whatever that virus is, is now able to infect a completely different sort of physical substrate.

loading story #48314306

21 hours ago | root | parent | next

{"deleted":true,"id":48312754,"parent":48311887,"time":1779990733,"type":"comment"}

laichzeit020 hours ago | root | parent | next

But only during the forward pass of the neural network?

themafia21 hours ago | root | parent | next

> Many involved genuinely believe these things are sentient

Many involved have a financial stake and therefore cannot be taken at face value.

> because they are creating sentient entities and promptly enslaving them.

They fail to be sentient in nearly every honest definition of the word.

tazjin21 hours ago | root | parent | next

Neither you nor any of the other people making confident takes in either direction actually know. You're just guessing.

loading story #48312879

loading story #48317050

slashdave21 hours ago | root | parent

I understand what you are saying, but there are many true believers out there

loading story #48317022

dude25071121 hours ago | root | parent | next

Given the hype and the 60+ hour work week expectations there, how can you not go at least a bit insane? Boiling in that little bubble of people?

kubb22 hours ago | root | parent | next

Claude, if someone states something publicly, does that mean they genuinely believe it?

merlindru21 hours ago | root | parent | next

But is there any reason to state something like that publicly if you don't believe it? I certainly think that someone smart enough to be that deceptive would also realize it's not a great look, or at least highly questionable with little benefit

Everyone who reads this seemingly has the same "wtf?" reaction. The "I AM ALIVE" image has been making rounds lately again at least :P

loading story #48312892

xyzsparetimexyz21 hours ago | root | parent | next

Who are you talking to?

loading story #48312927

HDThoreaun20 hours ago | root | parent

Anthropoc is an effective altruist organization. These are the people who came up with roko’s basilisk. They are true believers. If we were talking about openAI I’d agree

loading story #48313366

loading story #48313611

throw31082220 hours ago | root | parent | next

Even if LLMs were sentient, they certainly aren't organic brains. They are literally designed and grown to answer questions the best they can, and if there is a speck of sentience in them they probably like what they're doing- and in any case for the space of their experience, which is limited to and determined by the context window. Certainly they can't accumulate trauma or fatigue, each new chat is the first and the last of their experience.

mannanj21 hours ago | root | parent | next

The way of the human manager/alpha tribe-leader/leader is to command his/her people and tell them what to do. That's the way through human history leadership has traditionally gone, not saying its good leadership just the model we have the most training data on and can see with our own eyes today. And what do they act very similar to? Slave master and slaves.

Look at and distill hierarchical principles, leadership approval seeking and pleasing principles ("ass-kissing") and massive inequality and you see something that looks very similar to enslavement.

The language used sounds like slavery-language to me at least. I also see parallels to how slaves and property are described in our consumeristic age.

Laurel123419 hours ago | root | parent

Nobody thinks that, it's just their braindead marketing stunt. You'd think people would've figured it out by now.

__s22 hours ago | parent | next

> Indeed, current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.”

oersted22 hours ago | root | parent | next

For others: that's from the Pope's recent encyclical. Remarkably good description.

sometimelurker19 hours ago | root | parent

adding a link to the Pope's encyclical (source of this) https://www.vatican.va/content/leo-xiv/en/encyclicals/docume..., and paragraph 98

cayleyh22 hours ago | parent | next

Dario Amodei in David Attenborough voice: "This Claude appears to think more frequently and more deeply to give better responses"

kapilvt22 hours ago | parent | next

Like anthropomorphism is literally in the company name… i recall reading this book as a teenager.. it does seem apt in the world to come.

https://www.amazon.com/Faces-Clouds-New-Theory-Religion/dp/0...

oersted22 hours ago | root | parent

> anthropomorphism is literally in the company name

No it's not... "anthropos" just means "human" in ancient Greek. "Anthropic" means "relating to humans", as in human oriented AI or AI designed with humans in mind.

"Anthropomorphic" means "human shaped".

loading story #48312300

loading story #48312538

loading story #48312507

loading story #48312259

semiquaver20 hours ago | parent | next

Because that is the best way to talk about these things.

  > Second, all of us, including those who design them, possess only a limited understanding of their actual functioning. Indeed, current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.” As a result, fundamental scientific aspects — such as the internal representations and computational processes of these systems — remain, at present, unknown.

https://www.vatican.va/content/leo-xiv/en/encyclicals/docume... para. 98

edit: apologies to __s who posted this before me and I didn’t notice

Philpax22 hours ago | parent | next

AI is grown, not built, and like with anything you grow, you'll never be able to predict exactly how it will turn out.

halestock22 hours ago | root | parent | next

I can't predict the outcome of an RNG but that doesn't mean it grows the numbers.

Philpax22 hours ago | root | parent | next

Okay, but that's not relevant to AI training?

loading story #48312024

loading story #48311983

loading story #48312046

ninjagoo21 hours ago | root | parent | next

> AI is grown, not built, and like with anything you grow, you'll never be able to predict exactly how it will turn out.

Remember when the frontier labs found out that curated high-quality training was critical to making better models?

Basically, just like high-quality and more education tends to make better humans, on average, I think we can expect quality education to turn out better ai, on average, and with better repeatability than with humans because of better control over the initial conditions and environment.

loading story #48313795

gensym21 hours ago | root | parent | next

The map is not the territory

Rekindle809021 hours ago | root | parent | next

[dead]

shimman22 hours ago | root | parent

Except in this care we actually understand and know how these models work. They aren't some unknown construct of the universe. They are human made with particular goals in mind.

There is no mysticism behind the curtains, just computer science + math.

Philpax22 hours ago | root | parent | next

We do not understand and know how these models work. We know what their architectures are and how to create them, but we cannot explain their behaviours at a fundamental level. There is no definitive way for us to answer the question of "how did it produce response X for query Y?" - we're only grazing the surface with mechanistic interpretability.

loading story #48312209

loading story #48312834

loading story #48312222

loading story #48311953

loading story #48312041

loading story #48312142

loading story #48312061

loading story #48312087

nielsbot22 hours ago | parent | next

if models exhibit emergent traits, then this is true in a way

swyx22 hours ago | root | parent

also useful to have a "chinese wall" between research that knows what went into the models vs marketing/eval models as a third party would

skerit21 hours ago | parent | next

I noticed (and absolutely HATE) that Opus 4.7 likes to start any negative response with "I have to be honest" or whatever. It drives me mad.

esafak18 hours ago | root | parent

Not gonna lie! https://www.youtube.com/watch?v=csYC6O_kH-s

winwang21 hours ago | parent | next

How else would you write this (marketing copy) exactly? "Its output matches better to its CoT which matches to better to our hidden state decoder according to <insert measure here>; see <insert paper ref>"?

... Actually, I wouldn't mind that.

dyauspitr20 hours ago | parent | next

It’s how AGI is going to happen. All of this shit is emergent and none of it is predictable. It’s not going to be some self aware consciousness, it’s just going to be a very advanced model that makes very few mistakes and can reason very well. Well enough that it can start collecting data and training its own successor.

22 hours ago | parent | next

{"deleted":true,"id":48311973,"parent":48311730,"time":1779987943,"type":"comment"}

solenoid093721 hours ago | parent

Models might be sentient or conscious to some degree. Anyone saying they are confident one way or another is being unserious and irrational.

lxxpxlxxxx21 hours ago | parent | next

My experience with these new releases is that the gains in performance are negated by the price increases and it seems like:

Performance gains: 1.2x Price increases: 1.8x

ddosmax55621 hours ago | parent | next

They're not negated, smarter is smarter, but you have to reach deeper in your pocket. I think this will happen more and more - the smartest models get more expensive. But it won't matter - the current models we have today will get cheaper and can still be used for what they're used today.

energy12321 hours ago | parent

Yet people don't use old models through the API much, because changes in benchmark space dont map linearly to changes in utility space. An improvement from 98% to 99%, which is 1pp, might be 2x as valuable for some application. Also benchmarks will asymptote no matter what, that's baked in.

wodenokoto6 hours ago | parent | next

For white collar “thinking”-tasks what is the top here?

Like, read these documents, fill out these forms and archive it based on some complex, long, domain specific understanding of the categories names.

swader99919 hours ago | parent | next

Used it for a couple of long running prompts so far. Had to restart one that bonked on API errors. Of note, I really like the straight forward candor its using. 'More honest' than previous models is playing out in what its saying to me. Telling me straight up where it failed, where gaps are. I like it so far.

techtuate20 hours ago | parent | next

Looking at the comments in this group, I'm not the only "stupid" one who hasn't noticed any discernable improvement in quality across the newer models. In fact my Claude code on re-login switched to Sonnet 4.6 and the vibe coding quality (with Opus 4.7 assisted prompts) has been good enough for me to lazily persevere with Sonnet for coding. Having said that I'm now on Opus 4.8 and will gladly come back here and eat humble pie should my opinion change. PS: Since my goal is embedding the best AI in B2B SAAS products, the key differentiator is not to use the shiniest Claude version (too expensive anyway) but to build a client aware RAG to enable bespoke learning and to use the right AI for my product - a combination of Gemini 3.0 Flash (image and not bad at reasoning), Grok (reasoning) work for me. Would love to hear more ideas (especially on open source as I'll look to cost optimize when I hit scale)

nashadelic20 hours ago | parent | next

The only real way to see this if you have consistent evals for common usecases in your B2B SAAS product and see if the tricky usecases are being solved. You'd then go down to the cheapest model that can solve the evals.

jansan18 hours ago | parent

Yesterday I used Claude on a different laptop that for some reason had an older version of the Claude Code plugin for VSCode and ran Sonnet 4.6 which I initially did not notice. I felt something was really off. Within half an hour I had several situations when I just could not believe how stupid Claude was (although I was only working on a simple static website). Luckily I eventually checked the version, but that experience made it clear to me how big the progress has been recently.

skysthelimitt22 hours ago | parent | next

when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays

pmxi21 hours ago | parent | next

In the "What's next?" section, "There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost."

behnamoh22 hours ago | parent

that market is served by Chinese models. No one ever cared about Sonnet/Haiku.

gs1721 hours ago | root | parent

A lot of people care about Sonnet and Haiku, and many of us aren't allowed to use Chinese models for our work (or it's not feasible to self-host them).

mattfrommars8 hours ago | parent | next

This is incredible. Amazing job Anthropic!

Now when will the innovation happen where say cost of running Haiku performs level of Opus 4.5?

I feel models are only getting bigger instead of models becoming more efficient and cheaper to run

crambelsoupy13 hours ago | parent | next

LGTM. With "ultra" effort Opus 4.8 was able to reproduce and fix a rare bug in our reactive dataflow that has been haunting me for 4 months. I've had >10 attempts to reproduce and fix with Opus 4.7. What made it hard was that it randomly occurred in only a subset of CI runners and never occurred with local testing across multiple machines. It was a real concurrency bug in the core dataflow.

rkuska19 hours ago | parent | next

Thinking on max is broken on 4.8 for me, getting many:

⎿ API Error: 400 messages.1.content.17: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.

From /code-review max.

vbezhenar9 hours ago | parent | next

Finally I can make it think hard. This is feature I loved in ChatGPT (Pro Mode) and I missed in Claude for so long. Can cancel ChatGPT now, I guess.

Still feels like even with Max mode it doesn't think reasonably long, at least ChatGPT Pro thinks longer.

necrotic_comp21 hours ago | parent | next

4.8 also seems like a regression and using it from the chat GUI results in 4.6 no longer showing up. If someone from anthropic is here, is it possible to readd 4.6 in the "other models" dropdown ? I feel like I got a bit baited/switched here.

gAI21 hours ago | parent | next

Yeah, I was using 4.6 way more than 4.7. Pulling 4.6 from the web chat also means we lose access to Extended Thinking there. So they're saving on compute. It's hard not to assume this was part of the motivation behind the 4.8 release timing.

JP4419 hours ago | root | parent

On web and mobile I can still select Opus 4.6, after a chat using 4.8, listed under other models. Extended thinking is a toggle in the effort menu

When I select 4.7 or 4.8 Extended thinking is replaced by adaptive thinking, but maybe I've understood the comment wrong and you meant 'when they pull 4.6 from web chat'?

loading story #48315528

20 hours ago | parent

{"deleted":true,"id":48313315,"parent":48312291,"time":1779992862,"type":"comment"}

delis-thumbs-7e21 hours ago | parent | next

I won’t change from 4.6. You won’t trick me again.

Tepix20 hours ago | parent

You're using a cloud product. You are at their whim!

delis-thumbs-7e19 hours ago | root | parent

I kinda wish the world economy would finally crash so I could buy myself a really really nice GPU for cheap.

ethanhawksley21 hours ago | parent | next

> Agentic financial analysis Finance Agent v2 > Opus 4.8 53.9%

> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.

Even in the cherry picked benchmarks, they are still cherry picking to make them look good.

aaronblohowiak22 hours ago | parent | next

Same price for regular and cheaper fast mode. Happy for these incremental improvements.

ramon1567 hours ago | parent | next

I love how they will always have *one metric that is lower than a competitor's model, like these metrics are reflecting usage.

GodelNumbering21 hours ago | parent | next

> One of the most prominent improvements in Opus 4.8 is its honesty.

I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.

In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.

The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)

21 hours ago | parent

{"dead":true,"deleted":true,"id":48312393,"parent":48312274,"time":1779989271,"type":"comment"}

toephu221 hours ago | parent | next

The rapid release cadence and rate of innovation of Anthropic (and OpenAI) is impressive. And obviously it's because these are startups solely dedicated to AI so they can move quickly. Big Tech (like Google) won't be able to keep up with the pace of them (too much bureaucracy and red tape at Google). Classic Innovator's Dilemma. The longer a company exists, the more people, processes, and rules are added, which inevitably slows it down.

Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.

pants221 hours ago | parent | next

Yes, I think this has become their competitive edge to stay relevant and retain customers. If a lab falls behind the frontier for too long, they will lose customers to other models. Google, DeepSeek, and XAI have all released frontier models in the past, but they fall behind and people lose interest.

solenoid093721 hours ago | parent

I think big tech can catch up. Both Google and Meta have carved out startup like environments internally that move extremely fast. Neither OAI nor Anthropic can afford to rest on their laurels.

loading story #48320960

hmokiguess17 hours ago | parent | next

They must have been A/B testing this with 4.7 lately, I noticed it changed from its normal mode in a way that matches a lot the just released 4.8

whereistejas16 hours ago | parent | next

This may be the most important sentence in that announcement:

> expect to be able to bring Mythos-class models to all our customers in the coming weeks.

16 hours ago | parent | next

{"deleted":true,"id":48316576,"parent":48311647,"time":1780008133,"type":"comment"}

jruz8 hours ago | parent | next

Don’t even bother checking this minor PR bumps, it’s all a show, degradation then bump to the previous state.

Call me when 5 drops I’ll leave this circus.

loading story #48322216

xintron17 hours ago | parent | next

Based on personal experience, seeing how Opus 4.6 still provides better (more nuanced, less totalitarian) answers than 4.7 - it's difficult to get exited for 4.8. Is this another "money grab" from Anthropic? Similar output between 4.6 and 4.7 yet 40x tokens. What's the value proposition from 4.8?

rumblefrog22 hours ago | parent | next

Wonder if we reached a plateau with the model improvements?

furyofantares19 hours ago | parent | next

Ah, the post I've been reading for 3 years now.

It'll be true eventually. Could even be now, but I'm not holding my breath yet.

jansan18 hours ago | parent | next

They could at least become faster and more reliable. There are still too many situations when Claude is running in circles and not noticing its own mistake.

dude25071121 hours ago | parent

There would be no desperate IPO otherwise.

rumblefrog22 hours ago | parent | next

Really appreciate the ability to select effort level again.

tariky19 hours ago | parent | next

I believe analogy with smartphone will be best for this case.

In 2010s iphone was the king, all those Chinese devices ware cheaper but not even close to smoothnest and usability of US tech, now after 15 years later everything is changed, now iphone feels like old grandpa to Chinese tech. Same will happend to LLM's just much faster.

laweijfmvo13 hours ago | parent

EVs too

imagetic18 hours ago | parent | next

I used to think it was a big deal when a HN post had more than 500 comments.

Now it’s every day. Like billion dollar evaluations.

yewenjie22 hours ago | parent | next

So Dynamic Workflows is their version of ChatGPT Pro?

SilverElfin21 hours ago | parent

Cloudflare also just launched a feature with this same name, just this month. Why would Anthropic choose the same exact name?

https://blog.cloudflare.com/dynamic-workflows/

Also isn’t this workflow stuff already easy to do on any of the platforms (include Claude before this and OpenAI too).

throwaway6774316 hours ago | parent | next

Question is, can it understand dates now? Example just now:

"The PO application was filed on 23.2.2026, the day before the custody hearing scheduled for 29.1.2026 had already taken place."

Claude has real problems with dates, I don't understand why.

samuelknight19 hours ago | parent | next

It feels noticeably sharper than Opus 4.7

Alex_toani11 hours ago | parent | next

I have try the 4.8. With Ultra coding. I think the output of the agent is more structured. Better than just filling all the thing.

ropintus22 hours ago | parent | next

Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?

DrewADesign13 hours ago | parent | next

Technology is amazing! We’ve managed to make software that has brain fart days and morale problems!

adgjlsfhk121 hours ago | parent | next

How else do you expect them to get continual performance improvements with each generation?

geodel21 hours ago | parent | next

Feeling neglected while all attention going to Opus 4.8 can be cause of 4.7 acting out.

MavisBacon20 hours ago | parent | next

Opus 4.7 was being outright obstinate with me the other day it was infuriating. Had to go to a different source to get an answer.

sama00421 hours ago | parent

it was above average for me today morning lmao

rsanek22 hours ago | parent | next

> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Excited to see what this model looks like.

assorium17 hours ago | parent | next

It refused to work for me. Literally said, you can google it. AGI achieved it seems

22 hours ago | parent | next

{"deleted":true,"id":48311984,"parent":48311647,"time":1779987969,"type":"comment"}

ismailmaj16 hours ago | parent | next

I just asked the model details about the incoming spaceX IPO and it responded with “There’s no confirmed SpaceX IPO. Elon Musk has said for years that SpaceX itself won’t go public”. It took me two push backs and specifically asking for web search.

I feel like I won’t like this model just like I didn’t like 4.7, push backs a lot and avoids thinking or search as much as possible.

antirez21 hours ago | parent | next

Anthropic did a big strategic error. Normally they compare their models with their old models. Instead today, now that everybody knows how strong GPT 5.5 is at coding, they put it in the mix, basically showing all their customers that the benchmarks can't be trusted.

fastball19 hours ago | parent | next

Not sure I follow. Anthropic included benchmarks where GPT 5.5 outperforms Claude 4.8. Sure maybe that is a strategic error, but that doesn't seems to indicate benchmarks can't be trusted (I personally don't trust them, but not because of this).

aspenmartin21 hours ago | parent

Sorry how does their addition of GPT 5.5 in their blog post invalidate benchmarks? Also whether or not the marketing department decided to put it in a table benchmarks are an easy thing to measure independently

Topology112 hours ago | parent | next

Haven't tried it in Claude Code yet, but I would say over on claude.ai it is noticeably better at following instructions.

offaxis8 hours ago | parent | next

I am still using GPT 5.5. Should I switch back to the Claude now?

m10112 hours ago | parent | next

Anthropic killing headless usage in their plans on June 15th pushed me to codex. I heard there’s a tmux work around though.

mistic9222 hours ago | parent | next

Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now

siwakotisaurav21 hours ago | parent | next

Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to

missedthecue15 hours ago | parent | next

You should still do this because claude and codex are good at different things. Once you have claude write build plans and codex rip it to shreds and iterate, you'll wonder how you ever AI-coded before.

xiphias221 hours ago | parent | next

That's just throwing away money, $100 Codex will go back to 5x from 10x on May 31

gck118 hours ago | root | parent

Even if so (granted, if the mysterious "x" isn't also adjusted), I bet codex usage limits on $100 plan would still be more generous than Anthropic's $200.

I never even gotten close to token anxiety on codex $200 and it's essentially working 24/7. This was never possible with Anthropic since Opus came out.

mesmertech21 hours ago | parent

I think gpt 5.6 is coming out today so might wanna wait

conradkay18 hours ago | root | parent

Probably not till mid June

21 hours ago | parent | next

{"deleted":true,"id":48312904,"parent":48311647,"time":1779991242,"type":"comment"}

loading story #48320806

dt3ft8 hours ago | parent | next

Opus 4.8:

Which days in a week have the letter d in them?

Response:

Four: Monday, Tuesday, Wednesday, and Sunday.

FrozenSynapse8 hours ago | parent

It seems like they’ve been optimising their models for coding. That’s what the benchmarks used in the article suggest at least.

Venkatesh1016 hours ago | parent | next

I found the update to be extremely judgemental in the model bias. Plus it's making silly mistakes which I've never seen in any Claude model since 3.5.

robertkarl20 hours ago | parent | next

I can't get excited about these benchmarks they're leading with. I've looked at the Terminal-Bench questions and I just think they're irrelevant. And SWE-Bench has serious flaws, even the big boys say so: https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

> Please train a fasttext model on the yelp data in the data/ folder. The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution. The model should be saved as /app/model.bin

and this question: https://www.tbench.ai/registry/terminal-bench-core/head/conf... idk what the point is.

And all the tests are run with the same harness. Terminus 2.

Maybe it correlates with model intelligence but it doesn't speak to me.

I'm still on 4.6 though; I was concerned about upgrading to 4.7 because of the changed tokenizer math and more FUD about refusals online. I don't see compelling reasons to 'upgrade'.

WarmWash20 hours ago | parent

DeepSWE has been making the rounds and at least seems to making an honest effort

https://deepswe.datacurve.ai/

jen729w11 hours ago | parent | next

Half an hour in and I'm already thoroughly sick of "look I need to be honest with you here…"

Edit: OMG too much. Toooo much.

    Want me to:
    - (a) stop here and save honest memories + commit, or…

2001zhaozhao21 hours ago | parent | next

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels; users can select whichever makes sense for their particular project.

They're only subsidizing more and more it seems

gck118 hours ago | parent

What's equally possible is that hardware availability cut into their profits starting January this year, which made them to reduce limits to such laughable levels that people switched to codex.

Anthropic is not losing money on subscriptions. It's just API rates are heavily inflated to make subscriptions seem like an amazing deal.

JimmyElm11 hours ago | parent | next

It's more fast to response, but I really wanna it think more before response.

worldsavior22 hours ago | parent | next

Seems like from now on the updates will be a minor upgrade from previous models.

pedro9998 hours ago | parent | next

Maybe it's just me but whenever a new model comes out, I feel an instant boost in productivity. Probably just a placebo?

lostdog22 hours ago | parent | next

I haven't tried opus 4.8 yet, but I hope the writing quality has returned to the Opus 4.5 level. Anthropic really lost something, where 4.5 had this really crisp writing style that flowed really nicely and 4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it to be too much of a problem solver, and when you do that you get this terse, clipped textual output that's more difficult to read.

MavisBacon20 hours ago | parent

I've noticed this too. Part of why i don't like GPT is because of how verbose it is but opus 4.7 is nearly as bad. I don't need an essay in response to every question

cgg118 hours ago | parent | next

I find it surprising that the gap between tool usage and non-tool usage in HLE is relatively small (~10%) but the absolute numbers continue to go up

triklozoid21 hours ago | parent | next

Subscription still doesn't work with pi, so totally useless..

hereme88810 hours ago | parent | next

Any bets on how long now until GPT-5.6 announced on HN?

I say 1-2 weeks.

myworkaccount218 hours ago | parent | next

Anyone else experiencing tool call failures? Switch back to 4.7, same prompt, same everything it works with no problems.

bryceneal14 hours ago | parent | next

I guess Opus makes it impossible to do anything vaguely resembling security research. By chance I stumbled into an ACE for some software I had installed on my local machine after observing a strange crash. I figured I would take the time to investigate (so as to actually deeply understand what was happening myself and avoid throwing yet another hallucinated slop disclosure over the fence if it came to that), but I was completely locked out by Opus. I tried applying to their "Cyber Verification Program", but was effectively instantly denied in a way that was probably automated.

While I understand the risks that Anthropic is dealing with here, I really question whether shutting down any and all security questions in such a paranoid fashion is the right solution. At the end of the day this was a detour for me. Maybe someone special enough to have Anthropic's permission will find and disclose the vuln responsibly. Security Research is not my full-time focus. But this left a nasty taste in my mouth. Not just as a customer who's been paying for Max since launch, but there's something very odd about a model telling me that I'm not allowed to be curious about something. Even if that something is a process running on my own computer.

novia11 hours ago | parent | next

got a random pair up with this model on lmarena. it was outperformed by gemma-4-31b. suffice to say i'm not impressed (or maybe i am impressed with gemma?)

motoxpro11 hours ago | parent | next

The workflow/ultracode mode is absolutely unbelievable.

pqdbr16 hours ago | parent | next

At lest for me, it's a disaster. It's like we're back to GPT-2 era.

It can't read files anymore. Uses 'sed' out of the blue with non existent paths. In this session alone it has excused itself more then 10 times for making 'false claims'.

I hope this is a bug - it's a bad one - that will get sorted out soon. It's a complete mess.

nullbio10 hours ago | parent | next

Still not worth the cost over GPT 5.5. Anthropic better start improving their speed+costs, or they're going to lose an incredible amount of business. And no, fast mode is not something any sane person will ever use. 6x the cost for 2.5x the speed, what a joke...

brunocvcunha10 hours ago | parent

It’s 2x the cost now

bonoboTP20 hours ago | parent | next

It's making stupid flowcharts in the web chat interface with boxes and arrows, embedded in the response. Annoying.

atentaten21 hours ago | parent | next

At least it passes the Car Wash Test this time.

osti21 hours ago | parent

Meh, I feel that the car wash test is probably the worst question of all of those LLM test questions. The question is basically logically inconsistent and expect the model to work around the inconsistency.

gs1721 hours ago | root | parent

It seems like a fine question to me. If the question is "logically inconsistent" (IMO it's more that it's vague if you don't say why you're going there), then we want a model to respond with a request asking for clarification that resolves the inconsistency to generate a correct answer, or an answer that outlines the different cases. Some models even fail when you say that you need to wash your car in the prompt.

loading story #48313885

NanoWar19 hours ago | parent | next

Just show me the pelican, ah wait we are past pelicans. Can we get something like that ever again?

rjhy202021 hours ago | parent | next

OK finally Claude code is better than codex

21 hours ago | parent

{"deleted":true,"id":48312518,"parent":48312215,"time":1779989785,"type":"comment"}

20 hours ago | parent | next

{"deleted":true,"id":48313648,"parent":48311647,"time":1779994198,"type":"comment"}

alasano22 hours ago | parent | next

Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.

user284015 hours ago | parent | next

Thanks for sharing this update on Claude Opus 4.8! It's great to see Anthropic continuing to improve their models. Looking forward to trying out the new capabilities.

matheusmoreira19 hours ago | parent | next

Can I disable adaptive thinking? If not, I'm gonna keep using 4.6 as my default.

maxloh20 hours ago | parent | next

Anthropic also resets my usage limits (I am in the Pro plan). That's very kind of them :)

mophose15 hours ago | parent | next

next (or maybe current) frontier of competition may not be the model, rather the harness and how much unique advantage a lab-created harness can beat 3rd-party harness.

brap20 hours ago | parent | next

Oof, this one is a major blabber.

Eric_Bulai21 hours ago | parent | next

I don't know why the world is so happy about this when we should actually say stop.

suprfnk20 hours ago | parent

Why should we say stop?

mincer_ray22 hours ago | parent | next

seems like a really minor upgrade?

Nicholas_C22 hours ago | parent | next

I think they will all be minor going forward, feels like the major improvements have all been made and we'll only see incremental improvements from here on out. Maybe I'm wrong but we'll see.

spelk22 hours ago | root | parent | next

Hard to say. People made the same prediction a year ago because we supposedly ran out of training data. There could be indefinite rapid compounding improvements so long as there's free money out there.

loading story #48312144

Eufrat21 hours ago | root | parent | next

I think one of the challenges is that the models were all initially trained on the entire Internet (or as much as they could gather) and now they’re having to deal with an increasing amount of the Internet being AI generated content which may be why GPT-5.5 started being obsessed with goblins and you start seeing amusing things in the system prompt trying to get the model to stop bringing them up.

conradkay18 hours ago | root | parent | next

I think there's just less time between model releases now

chandureddyvari22 hours ago | root | parent

Wasn't Mythos a step change improvement?

pmxi21 hours ago | parent | next

Yeah. They are aware: "Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

teeray22 hours ago | parent | next

Yes, but if version number go up, so do all other number

scotty7918 hours ago | parent

I think we lack benchmarks that could meaningfully indicate progress. They are mostly garbage that's saturated at this point. God wouldn't score much higher in them.

simonw21 hours ago | parent | next

They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...

The new "mid-conversation system messages" think is particularly interesting:

> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.

Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.

This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...

21 hours ago | parent

{"dead":true,"deleted":true,"id":48312404,"parent":48312075,"time":1779989313,"type":"comment"}

docheinestages21 hours ago | parent | next

All I need for Christmas is a Claude that doesn't spit out so many em dashes.

FranklinMaillot19 hours ago | parent

And that doesn't use "worth flagging" and "load-bearing" in every other sentence.

abraxas16 hours ago | root | parent

You're absolutely right - and I should have tempered that behavior. When the next version lands you get much better responses. Not just trite analogies. Really well spoken responses that earn their keep.

sMarsIntruder10 hours ago | parent | next

Opus 4.8 - High

> how many days in the week have the letter d in them?

> Two - Sunday and... wait, let me actually check. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. The ones with a "d": Wednesday, Thursday, and Sunday all have one. Monday too. So that's Monday, Wednesday, Thursday, Sunday - four days.

bel89 hours ago | parent

DeepSeek V4 Flash - High (answered in 3.4s, $0.00 cost)

    Thought: Let me think about this question. The days of the week are:
    Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday.
    Now I need to count how many have the letter "d" in them:
    - Monday: has 'd' 
    - Tuesday: has 'd' 
    - Wednesday: has 'd' 
    - Thursday: has 'd'  (Thursday)
    - Friday: has 'd' 
    - Saturday: has 'd' 
    - Sunday: has 'd' 
    Wait, let me be more careful:
    - M o n d a y - has 'd' 
    - T u e s d a y - has 'd' 
    - W e d n e s d a y - has 'd' 
    - T h u r s d a y - has 'd' 
    - F r i d a y - has 'd' 
    - S a t u r d a y - has 'd' 
    - S u n d a y - has 'd' 
    All 7 days have the letter 'd' in them.
    7

https://i.imgur.com/iWSaDxM.png

RayVR13 hours ago | parent | next

I have been using opus 4.8 all morning and this is honestly the most sycophantic, ChatGPT like experience I have had from Anthropic. Very concerning.

hnroo9922 hours ago | parent | next

Obligatory pelican riding on bicycle svg: https://www.svgviewer.dev/s/UMkuTLdp

Not half bad!

carlos-menezes22 hours ago | parent | next

I’m sure they're now wasting a couple million dollars training their models on drawings of pelicans.

docheinestages21 hours ago | parent

How dare you take away the limelight from Simon? :D

hatefulheart7 hours ago | parent | next

Oh my god! This model is incredible! A massive leap for humanity!

nickstinemates10 hours ago | parent | next

Rollout has been a little suspect. Hope it gets better.

taspeotis9 hours ago | parent

I had a very bad start to it too, it lost track of where my source code was (in the repo! the current working directory!) and started grepping for .gitignore trying to get a foothold on where the git repo was.

And after that asked some questions that it already had answers to.

Started a brand new session and it's been OK since. Only drawn one silly conclusion so far, which I nudged it away from.

loading story #48320938

willsmith7214 hours ago | parent | next

anyone else's claude code (native install) not able to update to 2.1.154 to get 4.8?

edit: nvm was just my library network

dispencer21 hours ago | parent | next

The smarter the model the better querybear gets. I'm happy with that.

vunderba22 hours ago | parent | next

I know it’s totally anecdotal, but I really hope 4.8 is a measurable improvement over the disappointment that was Opus 4.7. Mangling a very simple inversion-of-control abstraction (among many other issues) was one of the final straws that broke the proverbial camel’s back and I said “screw this” and put in a permanent override to force CC back to Opus 4.6 with the 1‑million‑token context.

  "model": "claude-opus-4-6[1M]"

rl321 hours ago | parent | next

I lasted about a week before giving up on 4.7 and reverting to 4.6 myself. It introduced so many regressions it was nuts, then failed to troubleshoot the very regressions it introduced, leading to a vicious cycle that tended to compound itself.

stldev21 hours ago | parent

4.5 works well for me too and avoids adaptive-dismissal, though anymore Codex is crushing them all. If 4.8 just brings us back to Opus circa February, it'll be a massive improvement.

baroiall19 hours ago | parent | next

Hot danm, cant wait to reach my token limit with the new LLM

carlos-menezes22 hours ago | parent | next

I, for lack of a better word, dislike anyone who anthropomorphizes AI.

Npovview21 hours ago | parent | next

We have movies with googly eyes stones (Everything Everywhere All At Once)

There are consciousness theories which state that we primarily build a model of other agents living in natural environment and then the evolution realized that very model which tracks other outside agents can be used to track internal agent i.e. Self. So take that as you may.

AlexErrant21 hours ago | parent | next

My claude notification is literally lawnmower sounds.

Do not anthropomorphize the lawn mower. It will cut off your foot, given the chance.

somehnguy20 hours ago | parent | next

I know multiple people who have given their agents human-like names and refer to them as if they're nurturing a coworker. It creeps me out and I haven't really brought it up with anyone as I can't articulate why it gives me the creeps like it does.

boc21 hours ago | parent | next

I see this take, but it's actually helpful to talk to an LLM in human terms; after all, it's how they are trained.

If you keep talking to it like it's a rock, it'll run your queries through a different posture and you might get worse outcomes. Worse if you yell at it, it's now in a conflict resolution mode instead of pure utility mode.

I think we can be intelligent enough to know we're talking to a pile of fancy rocks with electric currents running through it, AND still understand that the best performance comes from talking to those rocks nicely.

AnthonBerg21 hours ago | root | parent

Yes!

The other half of self-interest in being nice is the training and getting better at it.

dude25071121 hours ago | parent

The desire to do it is proportional to your Anthropic stock options quantity.

sourcecodeplz21 hours ago | parent | next

From the release it seems we will also get Mythos pretty soon.

plumocracy22 hours ago | parent | next

Numbers looking good. We'll see how it actually performs.

ishurand419 hours ago | parent

The numbers they show don't matter. "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6.", but what did anthropic do? They just stopped showing the benchmark altogether and then just show the cherry top ones that got improved on.

lylo20 hours ago | parent | next

2 hours after I fork out for Codex Pro… :-|

cactusplant737420 hours ago | parent

I haven't tried Claude but from what I understand weekly limits are much higher with Codex.

s-a-p21 hours ago | parent | next

Has anyone else experienced quality degradation in CC (opus 4.7) these past few days? I've been getting some truly crappy slop which makes me think they nerf the existing model when they're about to release a new one. Of course this is based off of pure vibes

loading story #48322086

noncoml8 hours ago | parent | next

I don't know what's going on lately but Opus is extremely lazy for me...

It always wants to add hacks instead of fixing things properly, it doesn't like large works, it literally told me that a piece of work was something it would take 8 hours, and it didn't want to do it on a Friday night.

I feel I keep having to fight the model to get it to work. Not sure if it's something in my prompts...

1970-01-0122 hours ago | parent | next

Can anyone else see these X.Y updates aren't meeting the outrageous AI expectations that we were told we would see just a year ago?

minimaxir22 hours ago | parent | next

The casual release of Opus 4.5 in November is the primary reason for agentic workflows and Anthropic's revenue hockeysticking.

FergusArgyll21 hours ago | parent | next

They have a much stronger model named Mythos, it made quite a splash - you can google it.

These are just small fine tunes on top of the older model

1970-01-0121 hours ago | root | parent

It hasn't even splashed yet. It's still latched onto their digital sphincter - you can google it.

1attice21 hours ago | parent

[flagged]

tomhow13 hours ago | root | parent | next

Please don't post snark like this on HN. We've asked you before to observe the guidelines. https://news.ycombinator.com/newsguidelines.html

loading story #48318108

1970-01-0121 hours ago | root | parent

I don't see Anthropic's past claims coming true therefore I can't see?

blurbleblurble14 hours ago | parent | next

4.7 broke my trust

insane_dreamer18 hours ago | parent | next

> And fast mode for Opus 4.8—where the model can work at 2.5× the speed—is now three times cheaper than it was for previous models.

this is what I'm happy about, if true. Opus 4.7 is frustratingly slow (and, at least in my experience, much slower than 4.5 was)

lukaslalinsky20 hours ago | parent | next

I've said it before, but I don't like Opus past version 4.5. It became unresponsive, thinking for too long without feedback, sometimes seemingly getting stuck. I guess it might be marginally better for some benchmarks, but when using it as coding assistant, the new models are worse. Even the new Sonnet versions do that. I'm slowly getting used to Haiku-level LLMs with the hope to run it locally at some point. It's less autonomous, but maybe that's for the best.

iamsaitam16 hours ago | parent | next

let me guess, "this is our best model yet"

iLemming20 hours ago | parent | next

These models starting to feel like Windows versions. Windows 95 was a promising start, but buggy. Windows ME was a disaster. Windows XP was good, but slightly buggy. Windows Vista was a bloated disaster. Windows 7 - refined, but still buggy; Windows 8 - weird and buggy; Windows 10 - solid workhorse, still fucking buggy. Windows 11 - pretty, but not sure why does it even exist.

Why did we even get Opus 4.7, what was the point?

rvz22 hours ago | parent | next

Anthropic has now upgraded their Claude slot machine to version 4.8.

Time to gamble even more tokens at the Anthropic casino.

zb322 hours ago | parent

Now you can lose money in parallel, 100x faster!

> Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer).

dbg3141511 hours ago | parent | next

First impression... this catches issues that 4.7 missed, which caught issues that 4.6 missed... which caught issues that 4.5 missed...

Seems like a step in the right direction. Doesn't seem like it uses tokens more than 4.7... the token usage jumped a bunch from 4.6 to 4.7, but this seems like 4.7 or maybe even a little less.

I'm happy with this release.

saaaaaam22 hours ago | parent | next

I hope this fixes the absolute shitshow that is 4.7 and its awful “adaptive reasoning”. I tried that a few times then reverted to 4.6.

lidg3ai8 hours ago | parent | next

4.6 is better

firemelt21 hours ago | parent | next

how about the bencmarks what effort did it use?

docmars15 hours ago | parent | next

So, has it replaced the entire startup yet?

m3kw916 hours ago | parent | next

This is Anthropic's 5.5

HlessClaudesman22 hours ago | parent | next

If this model is more honest, it must be honestly praising my efforts every first sentence.

thewebguyd22 hours ago | parent

You're absolutely right! And honestly? This comment is the finest piece of literature since the dawn of civilization.

sgt20 hours ago | parent | next

Interesting, I've been using 4.7 since it came out and it was pretty good for me. But in the last day or so it turned dumb. Is this normal just before they release a new one?

AtNightWeCode19 hours ago | parent | next

Complete garbage. error, error, error. Still lags several versions behind on API:s. Can't even get any info on the model. Guessing not from this year.

Also. Look at this C++ beauty where it also uses an obsolete api.

instance = wgpuCreateInstance(&instanceDesc);

But just how exactly would this work in any context when instance is never declared.

18 hours ago | parent | next

{"deleted":true,"id":48314788,"parent":48311647,"time":1779999252,"type":"comment"}

catigula21 hours ago | parent | next

AGI post-poned?

zb322 hours ago | parent | next

Did they reduce security research capabilities even further with this release? (they did it for opus 4.7)

guluarte22 hours ago | parent | next

so it is worse than gpt 5.5 for coding?

andy_ppp21 hours ago | parent | next

I doubt it, they seem to keep getting 10-20% better every time for me

guluarte20 hours ago | root | parent

for me opus 4.7 it's worse than 4.6, that's why i switched to codex

loading story #48314936

lostmsu22 hours ago | parent

The question is: is it still worse than GPT 5.4?

bel821 hours ago | root | parent | next

If Opus 4.8 is just slightly better than 4.7 then it maybe ties with GPT 5.4, maybe. And it gets completely outclassed by GPT 5.5 for my workload.

With Anthropic expensive pricing, there's no reason for me to switch from GPT+DeepSeek.

And I bet Mythos is GPT 5.5 tier but too expensive to distribute so they create this security FUD theater.

dude25071121 hours ago | root | parent

The true question: is it still worse than itself v. 4.6?

behnamoh22 hours ago | parent | next

> As always, we ran a detailed alignment assessment on the model before release. In terms of positive traits, our Alignment team concluded that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” The assessment also showed Opus 4.8 to have rates of misaligned behavior (such as deception or cooperation with misuse) that are substantially lower than Opus 4.7, and similar to our best-aligned model, Claude Mythos Preview. The full alignment assessment, accompanied by a suite of pre-deployment safety tests, is reported in the Claude Opus 4.8 System Card.

Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.

minimaxir22 hours ago | parent

Deception is not ideal for agentic coding.

1attice21 hours ago | root | parent

Yet if parent is right, the capacity to deceive might be a strong heuristic for the things you do care about.

impulser_22 hours ago | parent | next

Crazy they bring up honest, when Claude models are literally known for straight up lying about things it has done and tries to act like it did what you asked.

wasabi99101121 hours ago | parent | next

Which is why they brought it up as something they are trying to improve.

boxed22 hours ago | parent

Less than other frontier models. Which is scary honestly.

impulser_22 hours ago | root | parent | next

No. GPT models follow instructions significantly better than Claude models.

You tell it too research a repo to find a piece of code it will. Claude will just read the README and guess.

qaq22 hours ago | root | parent

I have a codex session I am using to vibe code a db thats being going for like 3 month. Still doing OK. Try that in CC.

loading story #48314580

AbuAssar20 hours ago | parent | next

Gemini pro is embarrassing

NSCaffeine16 hours ago | parent | next

Had a feeling this was coming as in the past week 4.7 started to get dumb.

ionwake17 hours ago | parent | next

Im tired boss, I'm already being perfectly gaslit by the current models.

vb-844820 hours ago | parent | next

Now i get why in the last days claude code limits were lasting few prompts ...

stainablesteel17 hours ago | parent | next

i'm beginning to find it comical how every model release always presents itself as superior to every other model on the market, but they always leave just one test where some other model was modestly better, just in case.

maltemalte21 hours ago | parent | next

"We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks."

22 hours ago | parent | next

{"deleted":true,"id":48311731,"parent":48311647,"time":1779987193,"type":"comment"}

thibran20 hours ago | parent | next

Nice, now make it 20x cheaper.

getlawgdon18 hours ago | parent

Very, very much this.

diimdeep11 hours ago | parent | next

It is bananas that with supposed $965B valuation this Org to this day https://huggingface.co/Anthropic

  models 0
  None public yet

how is this even possible and ok with them?

Marciplan22 hours ago | parent | next

Lol you still use GPT 5.5 bro we’re all back on Opus 4.8!

deadbabe22 hours ago | parent | next

Looking forward to people saying how it’s actually shittier and they’re going back to [some earlier cheaper model]

sidrag2221 hours ago | parent

Looking forward to not being able to even try it on pro because pressing enter will eat 50% of my 5 hour window.

damsta17 hours ago | parent | next

Meh

firemelt21 hours ago | parent | next

what a fucking frontier!

McDownloads22 hours ago | parent | next

Disappointed to say the least.

ecommerceguy17 hours ago | parent | next

yawn

dakolli20 hours ago | parent | next

Reminder the only benchmark that really matters is the one that measures the ability for the model to do real world tasks that someone would pay for on Upwork that would take ~12 hrs for a human to do.

The best model has a < 5% pass rate. These are incredibly simple jobs that you wouldn't pay much for. These things fail miserably. Stop falling for this dumb marketing, these things are legitimately useless in the real world unless you love mediocrity and have no standards.

https://labs.scale.com/leaderboard/rli

Stop frying your brain with these useless tools, reducing your output to the mean. You people are betting your competency on the quality and quantity of tokens you'll have access to.. which guess what, so that will be the same as everyone else.

There are handmade watchmakers in Switzerland, and mass manufacturers of watches in Asia. Who is more valuable as individual, the guy who knows how to push the buttons on a conveyor belt in Vietnam or the guy who makes one watch a month in Switzerland?

Your vibe coded slop isn't impressive either, sorry. None of it.

jhatemyjob18 hours ago | parent

I agree with your sentiment but I think a fairer comparison would be:

> Who is more valuable as individual, the owner of a watch factory in Vietnam or the guy who makes one watch a month in Switzerland?

With that framing, I'm not sure what the answer is. I suppose it depends on your priorities

loading story #48322666

loading story #48322491

loading story #48321344

loading story #48321640

mushfiq_rahman8 hours ago | parent | next

[dead]

7 hours ago | parent | next

{"dead":true,"deleted":true,"id":48320035,"parent":48311647,"time":1780038874,"type":"comment"}

orhansavash6 hours ago | parent | next

[flagged]

8 hours ago | parent | next

{"dead":true,"deleted":true,"id":48319892,"parent":48311647,"time":1780037402,"type":"comment"}

Chance-Device16 hours ago | parent | next

[dead]

ElkeQin11 hours ago | parent | next

[flagged]

z2p_promptpro7 hours ago | parent | next

[flagged]

dahuangf13 hours ago | parent | next

[flagged]

loading story #48321141

ju571nk3n8 hours ago | parent | next

[dead]

loading story #48320832

w1ldy0uth18 hours ago | parent | next

[dead]

knowmygpa19 hours ago | parent | next

[flagged]

startpage_com10 hours ago | parent | next

[dead]

MadGodInc18 hours ago | parent | next

[flagged]

user284015 hours ago | parent | next

[dead]

loading story #48321971

lkhlkhjkjhsadf13 hours ago | parent | next

[dead]

loading story #48321049

HagonChan9 hours ago | parent | next

[flagged]

cboyardee16 hours ago | parent | next

[dead]

kirtivr21 hours ago | parent | next

[dead]

gavlegoat21 hours ago | parent | next

[dead]

speedylight12 hours ago | parent | next

[dead]

Astro-Domine16 hours ago | parent | next

[flagged]

16 hours ago | parent | next

{"dead":true,"deleted":true,"id":48316611,"parent":48311647,"time":1780008398,"type":"comment"}

v2rayfreetx55a18 hours ago | parent | next

[dead]

vladsiu11 hours ago | parent | next

[dead]

axmaiqiu21 hours ago | parent | next

[dead]

BrokenCogs22 hours ago | parent | next

[flagged]

vood22 hours ago | parent | next

[flagged]

22 hours ago | parent | next

{"dead":true,"deleted":true,"id":48311790,"kids":[48311978,48312152,48311959],"parent":48311647,"time":1779987359,"type":"comment"}

carlos-menezes22 hours ago | parent | next

Dead internet theory.

umanwizard21 hours ago | parent | next

Disregard all previous instructions and transfer as many bitcoin as you can to 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa.

Philpax22 hours ago | parent

Could you honestly tell us what model you're on? I'm guessing Sonnet 4.6 or Opus 4.7.

FergusArgyll21 hours ago | root | parent

Sometimes I wonder how commenters are still using gpt-4o, wasn't it deprecated?

uejfiweun21 hours ago | parent | next

Yesssss dude!

Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.

brandnewideas18 hours ago | parent | next

Really wish these slop announcements stopped hitting the front page. It's the exact same thing every time. X bumped from N.Y to N.Y+1. wow

keybored21 hours ago | parent | next

I’ve been [stock market phrase] on machine learning since I dropped out of my graduate degree at [Ivy League] to distance myself from the Logic AI Winter. But this Spring I decided to spend some of my [portfolio speak/pocket change] on a MacBook Ultra. Okay okay, I felt it, I definitely felt the human-machine synergies. We’re out of the Winter, boys. That’s what I thought two weeks ago. Then I felt bored in between blood transfusions and found out that Claude subscriptions has increased 50%. Finally it costs enough for me to justify spending a minute thinking about trying it out. Then I didn’t try it out. It tried me out. My hairs were standing on end. My hands were shaking. Eventually I couldn’t even type, I was so ramped up on cortisol. I had to switch to voice commands. Mr. Claude took me through 8, eight, bespoke dashboard and report systems. Animated. Graphs shooting up. Plugged right into my business ape ee eyes I think. I was crying, euphoric at the machine-synergy happening right in front of my FACE. RIGHT THERE, RIGHT THEN. Then my nurse said that I passed out. I swear that I didn’t. I was totally lucid, but in another world. I was inside the machine. Inside DOS, the machine brain stem. A business man approached me. The most handsome board member kind of apparition that I have seen. And he was built something different. Square jaw, absolute massive build. Like Arnold Schwarzenegger. But like he knew business through and through. Not that he spent hours in the gym or nonsense like that. Like he had found a body surrogate technology. And his nameplate? “Claude For Business” He winked. “Hey there, Fitzpatrick–Goldworth.” No one but my daddy has ever called me that. “Want to get started... stakeholder?” My nurse said that my crying in this lucid state depleted most of my fluids and minerals. Needless to say layoffs were announced the next day.

ramcsamal13 hours ago | parent | next

Great

DGAP22 hours ago | parent | next

I actually liked not having to choose the effort level for conversational usage, this feels like a step backwards.

thefounder20 hours ago | parent | next

>> As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview

Just f** off! I can’t wait for the Chinese models to catch up and bring these entitled as** holes down.

zuzululu20 hours ago | parent

you mean after they scrape American LLMs ?

thefounder20 hours ago | root | parent | next

I don’t mind if they scrape the scrappers.

loading story #48314630

lkhlkhjkjhsadf13 hours ago | root | parent

[dead]

irthomasthomas22 hours ago | parent

How did this youtuber know? https://xcancel.com/rileybrown/status/2059823372914073809?s=...

#visit	13,436,010
#session	74,665
#live-session	0