Story Detail of id 48464185 | Liveview Hacker News

frevib8 hours ago | on: Claude Fable 5

At this point Anthropic is a pure marketing and PR company. Super catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human life changing experiences. Boris Cherny coming to HN “Hi! it’s Boris from the Claude Code team” to get real tech people’s goodwill.

From Opus 4.6 there are no noticeable improvements for me in code generation. It works very well, till 90% completion, if you guide it correctly. And you need a little luck. For serious production code I need to understand what I’m doing so it helps a bit, sometimes.

matheusmoreira8 hours ago | parent | next

> Boris Cherny coming to HN “Hi! it’s Boris from the Claude Code team” to get real tech people’s goodwill.

This is a good thing. I wish every company would do this. I subscribed to Proton Mail after interacting with someone from their team here on HN.

pinkmuffinere8 hours ago | parent | next

> catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human life changing experiences

This is just good business sense. In what scenario would you ever make the names dumb and forgettable?

> Boris Cherny coming to HN “Hi! it’s Boris from the Claude Code team” to get real tech people’s goodwill.

This is good customer support, lol. From what I can tell, it is indeed Boris Cherny responding, not outsourced to AI or other staff. You're really getting a response from Boris. I suppose that is PR, but it's not unjustified PR, it's accurate.

I'm not even a crazy AI fan, but your criticisms are ridiculous here. It reminds me of the quote from Knives Out -- "Your Honor, she endeared herself to him through hard work and good humor."

IshKebab8 hours ago | root | parent

> In what scenario would you ever make the names dumb and forgettable

Clearly you've never bought a TV or headphones!

aspenmartin8 hours ago | parent | next

Your observations are right but pretty insane to consider them a pure PR company lol. They are making more frequent releases so yes the release-to-release quality is smaller but we’re still ascending quality and reliability curves the same way we have since GPT-3. You get a GPT4->5 leap every like 17 or 18 months I think it is

kingkongjaffa7 hours ago | root | parent

The gradient of improvement is absolutely not the same.

aspenmartin7 hours ago | root | parent

If anything its slightly higher. Feel free to provide any evidence to the contrary.

ECI (good aggregate measure using IRT): https://epoch.ai/eci?view=graph&tab=release-date&subset-view...

METR time horizon (now topped out): https://metr.org/time-horizons/

WASDx5 hours ago | root | parent

I like this one, although its data seem to overlap with ECI.

https://artificialanalysis.ai/trends

astrange7 hours ago | parent | next

> Super catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human life changing experiences.

They're originally named after the blends at a nearby coffee shop.

https://postscript.co/pages/brew-guide

I've noticed nobody at HN knows what "marketing" is or how to do it. It's not just naming things and being evil and cynical is not the most successful method.

…also frontier models are a superhuman life changing experience. If they aren't, what possibly could be?

ValentineC5 hours ago | root | parent | next

Found a tweet from a year ago about this:

https://twitter.com/brian_a_burns/status/1866987688794132816

Well, TIL.

chroma_zone6 hours ago | root | parent | next

My life has changed, but not necessarily for the better.

bitpush7 hours ago | root | parent

This is interesting. Do you have any source?

CuriouslyC8 hours ago | parent | next

I dislike Anthropic but I wouldn't argue 4.8 isn't an improvement on 4.5/4.6. Your tasks just might not typically need the extra intelligence.

jorl178 hours ago | root | parent | next

Opus 4.7/4.8 often over-engineers on my setups, plus:

- It talks a LOT more like GPT models. You know: wrinkle, shape, gate, coarse, scope, gap, path, production-ready-workflow-of-the-day, and so on -- "that's expected, a consequence of the previous like-driven workflow". If I wanted to get a headache using AI I would have gone with GPT in the first place!

- It outputs text in a much harder way to follow along. I can't exactly say what it is. Maybe a bit of everything? Bolds are missing, bullet points are gone, paragraphs are bland and too long, and it doesn't feel like a model programming with me, but rather a somewhat full of themselves grandpa developer looking down on me. It's very weird to describe this, but it is definitely how I feel.

Granted this can totally be because of the way it reacts to the prompts now. We've got a rather large corpus of skills and "rules and good practices" that Opus 4.6 responded to great, and maybe the new models just get turned into this when fed with them....I don't know.

Either way, with Opus 4.6 being as good as it is, I need Fable to be a significant step up to justify a price increase. if it can get me to babysit opus a little bit less on some stuff, it might be worth it. Otherwise, I'm very happy with Opus 4.6 and hope they don't deprecate it.

taormina8 hours ago | root | parent | next

I'd argue that 4.8 is a straight downgrade. For every type of task I've tried. It's been a gambit at this point. If 4.6 quits being available, I'm out at this point.

coronapl7 hours ago | root | parent | next

Reading so many contrary positions about which model is better or worse shows how difficult it is to measure intelligence based on personal experiences. Of course, benchmarks try to make the process as objective as possible, but they often don't correlate with our personal experiences.

The other day 4.6 was fantastic for x task. Today, 4.6 overengineered everything and I had to revert all my changes. When evaluating models, perhaps it makes sense to consider luck as an ingredient before reaching any personal conclusion.

surgical_fire8 hours ago | root | parent | next

I actually experience 4.8 as worse than 4.6 for everyday coding tasks.

dcchambers8 hours ago | root | parent | next

IME Opus 4.8 (and 4.7) is often a downgrade from 4.6. I find that it tends to overthink and overcomplicate things.

aspenmartin8 hours ago | root | parent | next

Yes but there’s a reason we don’t evaluate these models this way and instead do it as carefully and thoughtfully as we can at scale. Human evaluations are important but they are an absolute minefield of footguns. 4.8 is not a downgrade from 4.6 there is an insane amount of hard data that contradicts this.

computerex8 hours ago | root | parent | next

The flip side is that benchmarks are gamed even by the top labs. Benchmark performance doesn't necessarily correlate with real world performance.

aspenmartin8 hours ago | root | parent

Again correct but it overstates the issue. I can say labs don’t want this. This happened arguably unintentionally in Metas llama 4 release, it went horribly, heads rolled, and like several billion dollars were paid for new talent and the org that built llama 4 was destroyed.

Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests.

You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake.

taormina7 hours ago | root | parent

Listen, you can say “but benchmarks, the benchmarks!” all day long, but consumer know when we are being sold a lemon. If it can’t do the most basic of things at least as good as it used to, this is table stakes. Nevermind that if you can’t do the basic stuff, how on earth can you be trusted with more?

aspenmartin7 hours ago | root | parent

And you can say “If it can’t do the most basic of things at least as good as it used to, this is table stakes” all day long while people point you to much better evidence to the contrary too, I’d rather be on the other side of that.

taormina6 hours ago | root | parent

Listen. I don’t care about evidence. I care about my lived experience for the product I paid for. I used the new product. It’s actively terrible. To the point of not being usable. We’re all ancedata, but what is “better evidence to the contrary”? The known and game-able benchmarks that they know they need to win at, so they train it to. It’s all he said, she said, which is the only reason we keep having this conversation.

aspenmartin6 hours ago | root | parent

Yea but it’s not right? You or I or the myriad of other institutions inside and outside of academia can probe these models with an evolving landscape of evaluation sets, even those unavailable to the developers. It’s just ignorance to claim benchmarks are somehow useless or all being gamed. You choose your tools in the way you want, but just don’t call it somehow better than a myriad of more carefully constructed setups and scaled evaluations.

gen2208 hours ago | root | parent | next

Actually anecdata I gather on my job from myself and coworkers is the only benchmark I trust anymore, because it so heavily diverges from the “benchmarks”.

aspenmartin8 hours ago | root | parent

That’s your call just don’t expect anyone ever to take that seriously. It’s not like we don’t have exact evaluations like this.

gen2205 hours ago | root | parent

I would encourage you to look into the open evals of some of these benchmarks (find one that actually is open-data, this is itself a good challenge), read the results generated and assess them for yourself.

This is what myself and my coworkers (and many other people in this thread) are doing on a daily basis with real stakes and real tasks – which these benchmarks are all aiming to be a proxy for. There's a real, tangible [cost]benefit to [not] using the highest-ROI models and harnesses.

The people with real incentives and skin in the game are telling you that the data diverges from "the data".

I don't mind if you don't take it seriously, our jobs are more important to us than a benchmark is.

But I wouldn't opt-out of using your own eyes and the eyes of others so easily, especially when there are literally hundreds of billions of dollars in invested capital with an interest in a certain outcome... this is how you end up in "Emperor's New Clothes" situations.

aspenmartin5 hours ago | root | parent

Investigating on your specific use cases, codebases, workflows and tasks is important, there is nothing wrong with this and in fact it’s more important than benchmarks if you can do it well but the point is that is very hard and easy to totally fool yourself and go down a suboptimal path. I understand that people are going to do it regardless, I certainly do. And I have looked at more raw benchmark data than I can really even stomach, I can see annotation data in my dreams now.

Eyes and ears of others is incredibly important. But you still seem to think somehow benchmarks is part of some giant conspiratorial cabal. You have institutions without ANY skin in the game making extremely high quality benchmarks. Consider in academia there is little else to do outside of partnerships with these companies. But benchmarks you can do completely independently and with university grant level money (it costs maybe $10-100k for a reasonable benchmark in many cases). Not only that, “real tasks” are what many benchmarks measure. You have these companies with extremely good logging and well scaled measurements to really look at what works and what doesn’t.

gen2202 hours ago | root | parent

At this point I have a workflow that is fairly rote. I've yet to use a model newer than 4.6-1M-XHIGH that I trust to earn a higher ROI on that workflow, and not for lack of trying!

I personally don't believe in any sort of cabal (Occam's Razor hasn't let me down yet). Ultimately, I don't really care *why* they're wrong as much as I care *that* they have diverged from my rubber-meets-the-road measures of value.

That is concerning to me, because people are investing 100s of B's of capital based on the putative RoI putatively available to people like ourselves. When the benchmarks support this RoI thesis, but none of the anecdata does... that's really concerning!

Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing. And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.

recitedropper7 hours ago | root | parent | next

"Carefully and thoughtfully" is antithetical to the approach to benchmarks these days.

Maybe back when this was a scientific endeavor; not now when enormous, enormous amounts of capital are on the line. Along with an entire cult's chosen eschatology.

aspenmartin7 hours ago | root | parent

You can call it a cult but it’s several thousand skilled workers who know what they’re doing, by and large, most of whom have a PhD and know how science and statistics work. Benchmarks are incredibly hard, and any PR or comms department at any company is going to obviously want to make things as rosy as possible, but beneath this are earnest, expensive efforts to get good quality measurements. The better you can do this the better you can compete. If you want to make a modeling decision you run an ablation, and the quality of that decision is only as good as your measurements.

recitedropper6 hours ago | root | parent

The cult in this case is TESCREAL, not everyone working on AI. Last I checked not all the "several thousand skilled workers" in AI subscribe to TESCREAL ideology, although it has been a while since I've been to the Bay. Maybe things have changed since my time at Berkeley, and Dario's belief that he will eventually be made immortal by mind uploading is more widespread.

Otherwise we agree that benchmarking is hard, the benchmarks contain hard problems, and that there are many hard working people trying to accurately gauge what is going on. It is getting harder to watch though as all that is on the line taints the overall endeavor.

pythonaut_166 hours ago | root | parent | next

Seems like a bunch of noise. What does this even mean?

It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"

aspenmartin6 hours ago | root | parent

No it’s: evaluating these systems are complex and there’s a reason why sociology, cognitive psychology, medicine, etc are all done in careful double blind conditions with pre registered tests. It’s not that humans are not smart enough, as I said human evaluations are incredibly important. And yet they are a minefield of biases you have to worry about and correct for.

- evaluations need to be done at the same time to avoid drift in your bias

- you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work?

- which one did you do first? Raters have a tendency to bias in one direction or another

- you also know the label! You know which model is which! This biases your assessment…

And on and on and on. Careful science exists for a reason.

OtomotO6 hours ago | root | parent | next

There is no data that I would trust that contradicts it.

Frankly I don't give a damn about data that could be made up on the spot or appears to be scientific or meaningful while it's not at all clear how it was made (up).

Claude was heavily lobotomised for my work starting somewhen in February.

I talked to friends and people I know and trust and many felt the same. (I didn't ask them whether they felt like I did, but what they felt, how happy they were with agentic coding etc.)

I quit my abo in March and talked to said friends who are still on a plan just last week: they are still not happy, but company pays so whatever...

aspenmartin6 hours ago | root | parent

That’s ok but at what point is this getting into conspiracy territory? You have just said there is nothing you would believe to the contrary, but then by definition that’s not exactly a very thoughtful or insightful position.

orbifold7 hours ago | root | parent

[dead]

BoorishBears8 hours ago | root | parent

"Fable 5" is Opus 4.7, and the Opus 4.7 we got is a Sonnet sized model on a stronger base.

That's where all the regressions and inconsistency in experiences stem from: RL can still only go so far vs having more parameters

OtomotO6 hours ago | root | parent

Lol. If you're doing anything non trivial that's not a CRUD webapp but e.g. some physics simulation or high performance GPU code any and all models I've tried suck.

They are not just leagues behind what experts would code, they are not even playing the same game.

Which is to be expected, as there isn't so much physics or high performance gpu code available as there is for your typical CRUD API and JS frontend.

loading story #48469764

gruez8 hours ago | parent | next

I don't get it, your complaint is that they have catchy names rather than dry names like GPT-5.6? Does OpenAI hype their models less?

Aperocky8 hours ago | root | parent

Oh, Far less.

It's getting to a point that it's offputting, and the next step would be to put it into "untrusted" bucket. Opus 4.7 already burned their credibility once, 2 more strikes remain.

aenis7 hours ago | parent | next

Not my impression. I felt 4.7 was a regression, but I am again badly in love with 4.8 with the level of insights it produces in design discussions, and how long can it go unattended while producing spec-adhering quality code. There are problems it still can't solve well, from the edges of algorithmics and far from the mainstream, but for lots of stuff it is godlike.

Also, I dont think Boris C. is coming here for PR. He is a tech guy, and this is the best place for tech discussions. Why so cynical? The guy is an engineer.

jwpapi8 hours ago | parent | next

I don’t even think that Boris is really just one person. He apparently vibe coded Claude Code and is responding on Threads, Twitter, HN and everywhere.

guybedo7 hours ago | parent | next

They're good at marketing, but my first subjective assessment of Fable is that it's really smart.

I've been working with gpt 5.5 and opus 4.8 quite a lot, and interacting with Fable feels like a smart guy just entered the room.

boc3 hours ago | root | parent

Yeah idk what people are talking about- it's not marketing. This thing is substantially better than opus 4.8/gpt5.5 from what I'm seeing today.

avaer8 hours ago | parent | next

If you truly believe this, you've discovered a superpower over everyone else in the industry.

While everyone else is wasting time and money on the slower, more expensive models, you've found a way to outpace everyone for less money. Everyone else is wrong and you will get rich.

(I don't actually believe the premise is true, I'm just pointing out the logical conclusion to what you're saying so maybe we can reconsider the premise)

xyzsparetimexyz7 hours ago | root | parent

Thats not how costs work. You don't get rich off buying a €10 hammer that's the same quality as someone's €50 hammer

iillexial6 hours ago | parent | next

>Hey! Boris from the Claude Code team!

>TOP 5 METHODS FROM BORIS ON HOW TO SPEND MORE MONEY ON TOKENS

>Boris from Claude just told he doesn't prompt anymore. He LOOPS instead

>"chatgpt has gotten soooo much better with the latest update."

>"codex is the best AI coding product and we want to make it easy to try."

Karpathy about Fable 5:

>"You can give it a lot more ambitious tasks than what you're used to, the model "gets it""

Sam Altman about gpt-5.4:

>In my experience, it "gets what to do"

What a time to be alive. Models are great, but all the slop, marketing, and fakeness around them is just unbearable.

atleastoptimal7 hours ago | parent | next

> At this point Anthropic is a pure marketing and PR company. Super catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human

Lol anti-AI bias on HN is crazy. Simply giving your product a quirky name is now being considered manipulative advertising. Is just doing normal PR and marketing something AI companies aren't allowed to do?

ausbah7 hours ago | root | parent

when they keep saying “oooh this new model is too big and crazy and totally can’t be released” or “this new model is a 10x game changer totally unlike our previous iterations” it feels sort like boy crying wolf. yes they’re still pretty clearly improving models, but when you’ve hit diminishing returns / more incremental gains and you’re still saying this is sounds like pure PR hype from a company that previously been the “honest good guys” in the room

atleastoptimal7 hours ago | root | parent

Their model did find thousands of security vulnerabilities across the companies they previewed Mythos with via project Glasswing. Is it not sensible that, given that emergent level of capability, that they do this gated release structure, as all those vulnerabilities would be exploitable by anyone using a Mythos-level model?

thefreeman8 hours ago | parent | next

How can you make this comment before even having a chance to try the new major model revision?

piyuv8 hours ago | parent | next

Current AI hype is built on marketing and PR, not capabilities, and has been from the start.

I still remember Sam Altman “begging AI to be regulated” and AGI being “some thousand days away”.

Breed faster horses and hope one will birth a locomotive.

WarmWash5 hours ago | parent | next

Don't forget the DoD stint that gave them this recent public boost.

Defy standard DoD precedent going back forever, that every other country has some form of too, and championing it like they are some kind of moral freedom fighters.

Like selling the DoD guns and telling them they can only shoot bad guys with those guns, and that you will be the one to decide who counts as a bad guy...

xpct8 hours ago | parent | next

Indeed, hearing "Mythos-class model" felt very icky to me.

b3kart8 hours ago | root | parent

https://en.wikipedia.org/wiki/Typhoon-class_submarine vibes

reasonableklout8 hours ago | parent | next

I think this says more about your type of work than anything. For bugfinding/incident response in distributed systems - which often involves extensive use of Datadog/Sentry MCPs and poring over heaps of logs in addition to reading tons of code - 4.8 has been significantly better than 4.6.

nozzlegear7 hours ago | root | parent

> Sentry MCPs

Oops, time to reauthenticate for the 10th time!

system28 hours ago | parent | next

You are right; all I noticed was a big-time slowdown. They increased the quota, but I cannot even reach the end of the day with these speeds. .NET coding somehow improved, though.

MattGaiser8 hours ago | parent | next

Doesn't this suggest your use case is simply insufficiently complicated?

mawadev7 hours ago | parent | next

When the Ai overlord is descending into pleb space to say Hi, you know stuff is real

MagicMoonlight8 hours ago | parent | next

[dead]

chis7 hours ago | parent

Hackernews not blindly hate on AI challenge: impossible

#visit	13,690,360
#session	74,665
#live-session	0