I've tried throwing unsupervised agentic software factory workflows against the wall, and they burned through my tokens like nobody's business but didn't produce much.
Supervised, human-in-the-loop process on the other hand is much more productive but doesn't consume nearly as much. Maybe that's why everyone's pushing agentic approaches so much.
Dealt with that by going all out and making an agentic parallel code review skill. Basically an infinite TODO list generator. Now I'm definitely getting 100% of the usage I paid for. It really burns tokens like nobody's business, and catches a lot of issues while at it. I've been looping this review/fix process every week. It's dramatically reduced the amount of stuff I need to pay attention to during my human review sessions.
https://www.amazon.com/Passion-Lubes-Natural-Water-Based-Lub...
> This product is out of stock
Ah, shoot, there go my weekend plans. Bummer.
Isn't this a (mildly exaggerated) description of AWS, which is a very successful service?
So your costs scale with the number of users you have.
Thats an op ex that you can explain.
For tokens for developers its maybe closer, cost/outcome wise, to hiring an external consulting company to write your code; money paid scales with work done, no promise of delivery, arbitrary unpredictable external price changes.
Its not quite the same; though, similarly lucrative for consultants.
Not if you're using it for running builds, running research jobs, model training, etc.
Like the other commenter said: cloud spend can also spin out of control if you don't pay attention, yet we've found ways to keep it under control (training, guardrails, limits, transparancy).
Personally, this feels like its just trying to push the work of managers in allocating resources onto developers so that they have more work to do and can be blamed if anything goes wrong.
Yes, but in a "oops this is gonna take another two months to finish" kind of way, not the "oops this is the 12th time this month 8 developers have burned $2K in tokens in a single day and no one really knows how it happened" kind of way.
people still can't get over the unreasonable effectiveness of algorithms.
I get the anti/skeptic sentiment. I've been called a lot of horrible things by a vocal contingent when they hear that I help train folks to learn software engineering best practices and then apply AI to that.
Actually it’s really its own thing, I don’t think the slot machine analogy works too well, you also have fixed odds (and you know they aren’t in your favor), and a binary output
Sidenote but I hope everyone realizes that 100 is kind of arbitrary here and does not mean the total chance to to get something is 100%.
If this is the "analogy" you go for, you don't seem to be suited to make that comparison.
Colleague used Sonnet 4.6 on some pretty normal agentic coding tasks through AWS Bedrock to keep the data in the EU, 100 EUR usage in a single day. In comparison, the Mistral subscription costs about 20 EUR per month and we tested that for similar tasks it was okay, the usage got to around 10% of that monthly limit in a single day. Or Anthropic's own Max (5x) plan where you get way, way more tokens to do with as you please.
I feel like the sweet spot is having a monthly subscription with any of the providers (you're subsidized a bunch), but if you have to pay per tokens, now I'd just look in the direction of what tasks DeepSeek would be okay for, sadly probably not in the situation above. For a startup, though...
On the other hand, this feels a bit hypocritical:
> It was part of an effort to get project managers, designers, and other employees to experiment with coding for the first time, and sources tell me that Claude Code has proved very popular inside Microsoft over the past six months.
They're gonna say that the future is all AI... until they get the bill.
I upgraded my plan last night to Mistral Le Chat Teams. This now costs me €60 per month for two users. Limits have been reset, but I have no idea now if my per seat limit is higher than the Pro plan, or if the limit is shared between the seats, it’s really not clear. I guess I will find out next month. The limits reset on the first of the month and I really hope I don’t hit them in the next seven days.
I use Mistral Vibe CLI and I’ve written and implemented a couple of new skills[1]. Caveman, based on an idea I found online somewhere, this skill removes all extraneous response text, including articles. Makes for some fun reading, but supposedly reduces output tokens significantly. Hash-anchors, this one is based on a concept from Dirac[2], reduces search failures and also includes multi-file dispatch. It will be hard to measure, but Vibe tells me these two should result in roughly a 40% reduction in token burn.
The results for a function implementation and test of levenshtein distance in js are pretty similar but Mistral is 30x cheaper than Opus 4.7 and 4x faster than Sonnet 4.6.
Levenshtein distance is not only a well-understood problem, it's small, self-contained, and extremely well-represented in the training data. The kind of problem where even small/bad models can excel. The golden standard for those tasks is just "use a library" so no wonder the beefy models are expensive: you're chartering a commercial airplane to go grocery shopping.
My personal benchmarks are software engineering tasks (ideally spanning multiple packages in a monorepo) composed of many small decisions that, compounded, make or break the implementation and long-term maintainability.
There's where even frontier models struggle, which makes comparisons meaningful.
It’s making guesses not decisions, framing as decisions will lead you astray to wasted time and tokens.
It’s vaguely productive to tell them a ton of relevant info upfront attempting to minimise their need for load bearing guesses. I say vaguely because obedience is generally only around the level where it's good enough to lull you into a false sense of security, not to actually be obedient.
It’s a bit more productive to use the various loop mechanisms (hooks, /goal etc) to evaluate each end of turn against guard rails and reject with clear instruction on whats unacceptable. Obviously if you only do this without the front load of info then you’re likely to spend more tokens to reach a satisfactory end of iteration.
Breaking code up into composable chunks has worked well for me over 50+ years as a professional software developer, and I can't get away from the idea that it is still usually the way to go using agentic coding tools.
I mean, the will continue to say so, they just want to be the ones being paid for the service, not anthropic :)
I tend to work with the agent, and observe what's going on as well as review/test and work through results/changes. I spend a lot more time planning tasks/features than the execution, even using the agent as part of planning and pre-documentation. It works really well. I don't think people burning through the 5hr allotment in under an hour are actually reviewing/QC/QA the results of what they're doing in any meaningful way, and likely producing as much garbage as good (slop).
I'm really curious as to HOW the MS employees were using the agents as much as what they were doing.
By buying a subscription and dealing with the limits, using claude code and paying per token seems like the fast lane to the poor house.
Me: We need to do this this that.
Claude: <random stuff that approximates human outout>
Me: Are you sure?
Claude: Well actually there is a bug <more random stuff that looks right this time>
----- Now it is:
Me: We need to do this this that.
Claude: <random stuff that approximates human outout>
Claude: Let me consult the advisor on that.
Claude: advisor came up with some advice, adjusting according to that. <more random stuff that looks right this time>