I can recognize so much of the GPT/Codex generated code long after it gets merged (not by me).
Additionally, the time spent on every agent turn on GPT 5.5 is much longer compared to Claude Opus 4.8, which means iterating on the code takes a lot more patience, and there's a lot more nitpicks to pick when actually using GPT 5.5 to do software engineering.
Feels like GPT-style models are more geared on doing one-shot software vibing (and handling the vibe coded mixture) compared to Claude's focus on actual software maintenance. I got a GPT Pro sub for free and wanted to cancel my Claude subscription so much, but I still keep reaching Claude models a lot more. Frustrating.
this is the line I keep in Agents.md that helps me prevent Codex from playing smart
When a "person" that you don't view as a "real" person repeatedly does exactly what you just told it not to do (often amid false assurances it understands and will avoid doing so in the future), most people get angry.
Compare it to how the kind of people who treat children like property treat their kids, or other examples of keeping people as property.
We were reviewing reports of situations where the models failed to follow directions and there was a common thread of some where when the operator got the model to acknowledge the rule breach, it quoted back something that included swearing.
I don’t have the data to truely look into it, but I did give the instruction to my engineers to avoid it as a “might be a problem”.
But I avoid unnecessary emotion in my prompts because I don't want potentially distracting activations. Kind of like communicating with humans.
> impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.
Unless the mechanism is understood, my assumption is that this is a moving target.
https://www.anthropic.com/research/emotion-concepts-function
How so? Plenty of swearing in lots of training data, especially older code, e.g. in Linux.
Bonus points if you find yourself actually saying it out loud while typing it.
I have used the word "shenanigans" way more in a couple of years of agentic coding than in 30 years of writing code with humans.
ai llm are doing what i tell them to.
if you’re building something meaningful (in my case a platform used by many people across many companies) you want to ensure you
1. have actual systems engineering and architecture in mind that you want the models to
2. implement based on what you tell it to do
when i was just telling the models what i want done without doing due diligence it would go and do some moronic implementation that was awful. mid input = mid output
these days i just maintain specifications documents and the AI follows everything i tell it to in that document. so when i tell it to dos one thing, the result is made following those architecture specs.
i have code that is single resp, modular, easy to extend and test.
i would ballpark 95% of the time i get what i asked for.
sometimes it tries to be clever in cases that weren’t covered in my arch specs. in those 5% of cases i go and update my specs.
source: used billions of tokens worth to build something actually in production across both mobile platforms and web, deployed on my own cloud infra. i use codex mainly. some claude.
But Claude models seem to be better at long term problems or more ambiguous problems.
I'm curious as to what the primary benefit here. Are there secret improvements in training? There hasn't been much in fundamental model architecture, I don't think. What about harnesses? I wonder what's pushing the AI. It seems like harnesses is the main thing pushing AI ever since CoT.
I think the end game is routed model usage and SLMs. I think Apple is going to prove this in the consumer space pretty handily and I'm curious how the Android ecosystem responds since the hardware is considerably lacking in model performance. I think Apple has a huge opportunity here, as much as I don't like their current ecosystem of walled garden. They did position themselves very well with ARM and custom chips for their hardware. Hopefully the broader ecosystem of ARM and Linux are able to make some headway and we see a more formalized, and broadly accepted, architecture to capitalize on.
I’m sure you could put something similar together with a bunch of duct tape and 2 weeks of effort, but it won’t work nearly as nicely nor out of the box. so…what am i missing?
My company has an agreement with the big providers and while i'm pretty sure they think about how to get budget back, its an competitive advantage and normal people will not learn different model behaviours.
At least for now.
Regardless of what others are doing, US labs here are just rushing to IPO. It's NOT a sign of confidence.
It's the equivalent of saying you have confidence in SpaceX making revenue by renting out their data center (instead of their AI making bank).
On the same note. if spacex is doing datacenters on earth successfully what's wrong with that? They rented cloud infra to a #2 or #3 provider in the world after < 2 years in business. It's a success, no?
If you get hired as a staff engineer and do the work of a junior, what's wrong with that?
Clearly xAI (now part of spaceX) did not raise funds to be a data center. The margins are way different. There are plenty of recent IPOs in that area that are worth at most billions not trillions.
> going to IPO is a sign of confidence , you need to report a lot of things, that private companies don't.
This isn't going to IPO. This is rushing to IPO. It is a sign of confidence that the market or wider environment might crash soon so we need the liquidity now.
> This is an exact reason chinese labs do not rush to go public.
Maybe or maybe not. If you are referring to Chinese labs - both the Hong Kong and China stock market are way weaker than Nasdaq. It's not comparable. Check all the recent Hong Kong IPOs that have tanked.
So no, reason not to might just be: no money in it.
There are huge numbers of users (myself included) that do have an exact idea of what inference costs are - on open models. We can buy tokens from 3rd parties that have no motivation to subsidize our use. That's to say, there's a fair marketplace[1] and we're hanging out there.
If you want to say "I don't think anyone has a firm grasp on actual inference costs on these proprietary/closed models", then I could agree with that.
China subsidizes strategic industries, and they have heavily done so with AI. And DeepSeek specifically has said they have no commercialization plans.
For example: https://www.boc.cn/aboutboc/bi1/202501/t20250123_25254674.ht...
It’s generally established that Anthropic/OpenAI are going for all out performance with big VC dollars at the expense of efficiency and China has geopolitically limited compute and an inventive to compete on value per dollar.
Why not? Hetzner charges WAY less than AWS too. Can you not believe that?
We know roughly how much these companies spend and what their revenues are. Based on that, they'd have to more than double revenue (without spending more money) just to stay even, and that's not good enough given how deep in the hole they are.
> OpenAI and Anthropic are heavily subsidizing their inference -- no wait, they are charging the most they can get away with before going public. Where is the truth?
Both are true. I mean, I'd be willing to spend a bit more than I do now, but not more than double, and neither are most companies. The company I work for is currently investigating how to reduce LLM spend, not looking to spend more.
Both. They are charging the most they can get away with and that amount is still heavily subsidized by VC capital.
Now that 200USD subscription starts to feel cheap...
I haven't gotten close to this either before, but now we wanted to move fast because this branch gets conflicts all the time and we want to get over with the migration asap.
And don't get me wrong. Opus did an absolutely horrible job at first, second and third round in this task. You really needed to steer it to get to the right solution.
And now Fable is out. And its first round of code reviews for this huge PR was definitely worth the money too...
Don't think that I'm just shrugging to that number. I see it every day, and I don't like that it's in the thousands now. But for people paying the 100 or 200 dollar plans, I'm not super sure if you will be able to use them in the future if the token price is in the thousands for a bit bigger task...
If I'd pay this from my own pocket, I'd definitely go with DeepSeek or local models and figure it out how to make the best use of them.
IOW, you don't really think the value of this work is really worth $4k.
> why would I pay to do my job?
The question is: how long do you think that you employer will be willing to pay for you and Anthropic, if you yourself said if it were your money you'd put some time and effort to work with an open model?
I wonder what this question really means? Anthropic is useless if you don't know what to do with it. It's very useful if you do, and you can guide it to do the right things. Yes, it will for sure reduce the amount of people we need to hire. But we are always looking for hires who know what they do and can utilize agents to be faster.
But if you think about how long employer is willing to pay 10-20k per month per seat for Anthropic? I can't see this to be feasible and it will have to end at some point.
It's worth it, and I can afford it, but I am not really the right type of user for token-based usage. It's all for personal and free work.
Unfortunately, that doesn't work within a single session. The K-V cache of a model is intertwined with the model's configuration. Switching models invalidates the cache, meaning everything up to the point of the switchover is processed like a new, uncached input token.
Per Anthropic's pricing doc, an Opus 4.8 cache hit costs 50¢/MTok, while Haiku costs $1/MTok for uncached input.
Model selection works best if sessions are short and self-contained, particularly if the first few interactions can reliably classify the model need. That probably covers most 'support chatbot' use-cases, but it doesn't describe the kinds of heavy agentic automation that really chews through token budgets.
I don't think this is true if you simply quantize the model or run it with fewer active experts? The underlying weights would stay the same. You could also play further tricks with skipping some of the model's middle layers outright, which works surprisingly well due to how skip connections are used.
Most AI companies are just testing the waters with paid tiers right now, their greatest fear with increased pricing is folks reverting back to wikipedia, stack-overflow and other public domain organic activity buzzing back to life; that will kill any RoI potential in LLMs forever. They're playing the wait game instead, observing how the digital sphere reacts to every little increase in price.
If that weren't the case, they'd be pricing at lucrative premiums already and even gotten away in short-term considering the increased dependency in the enterprise world. But that'd be like killing for the golden egg too soon and losing all long-term potential.
Once the folks are so addicted to LLMs that even writing a hello world program sounds like a nightmare and coming up with an article draft feels like reinventing Egyptian glyphs, that's when the real pricing hammer will come.
Anthropic wanting to switch billing to API rates is them just wanting to generate more profit.
Even if subscriptions are locally profitable (i. e., the cost of the subscription covers the cost of inference), they're still subsidized because they don't cover training and running the company; otherwise, these companies would be profitable.
Take a look at China for example - they have no access to NVIDIA, so they're trying to build their own hardware, they have no unlimited funding, so they try to optimize things.
And Anthropic is complete opposite of that - if NVIDIA were to triple their prices tomorrow, Anthropic would still pay them.
In the end, either we all somehow go mad and start paying Anthropic tens of thousands of dollars per month so support this madness, or we will go with whoever isn't lighting cash on fire.
Not true. Stop following US media spam if needed.
1. Very recently, the US did close a loophole on sanctions that allowed Chinese companies to use NVIDIA hardware outside of China i.e. before that was closed they all had access. The trick was train outside, do adjustments, ship the disks back and use non-NVIDIA in China, but at least the training and endpoints not hosted in China could all use NVIDIA.
2. There's been plenty of reports including fines and bans e.g. to Supermicro on smuggling NVIDIA hardware to China. I doubt it has been stopped. You can't catch everyone.
Granted, it could still mean that Anthropic just chooses to lose money - but that's Anthropic's choice.
DeepSeek has proven that inference can be much, much cheaper than what Anthropic advertises on their API rates page.
So they are profitable?
I think you are mismatching accounting terms.
You can't say the 'subscriptions' are profitable without accounting for the cost of making the model that is the source of the subscription.
They are heavily subsidized by the shareholders. Investing, running at a loss, with hope of some future profitability.
If saner factory can sell you the same tool at a fraction of the cost of a gold plated factory, your choice is going to be obvious.
Having said that, I found the cloud dev environments slow to the point where I wasn’t sure if it had frozen, so I never looked back.
Though the day is coming when there’s no distinguishing, I’m sure.
Also, is it really a defense department when you're starting wars of aggression every 15 years or so?
Just like how changing Kennedy Center letterhead to Trump Kennedy Center for a year didn't actually legally rename it.
Once a case with sufficient standing got in front of a judge it reverted to the actual legal name on the basis that only Congress can change the statutorily defined name.
For an admin so obsessed with legal names instead of chosen ones, you’d think they’d be less hypocritical.
I'm doing basic web development here utilizing animejs. Nothing too complicated (mostly saving time doing the scaffolding, still write the bulk of animations manually).
Truly believe that American companies are going to get completely curb stomped by China due to greed, ineptitude, and violating the social contract.
Deepseek V4 Flash is suprisingly capable and insanely cheap. It takes so much to get the session cost to get to $0.01.
I agree with you on pricing, but what do you mean by this?
Why aren't corporations doing more to help workers with childcare? Why aren't they doing more profit sharing with workers? Why aren't they encouraging unions or sectorial bargaining? Why isn't the government mandating any of this?
Americans very rarely benefit when US corporations do well. That needs to change. No one benefits if Meta continues making billions in profit every quarter while society suffers from isolation, depression, suicide, and scams from their services. Americans don't benefit if health insurance companies are making massive profits while they can't afford deductibles.
Our society has been setup to simply extract wealth in all facets of life. That's a sick society and it needs to change.
I'm not saying China does this better, in fact China has some of the worse worker rights out of all the industrialized countries; but at least American consumers would benefit from cheaper higher quality Chinese goods. The world would likely benefit too if America got off the cold war hype train that did nothing to benefit humanity outside of those making weapon systems.
The AI companies sure are a brilliant example of corporations needing to do more to help their employees pay for childcare.