LLMs work best when the user defines their acceptance criteria first

https://blog.katanaquant.com/p/your-llm-doesnt-write-correct-code

405dnw | 20 hours ago | 355 | HN

Their default solution is to keep digging. It has a compounding effect of generating more and more code.

If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.

If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

unlikelytomato18 hours ago | parent | next

This is why I'm confused when people say it isn't ready to replace most of the programmer workforce.

Foobar856815 hours ago | root | parent | next

LLM code is higher quality than any codes I have seen in my 20 years in F500. So yeah you need to "guide" it, and ensure that it will not bypass all the security guidance for ex...But at least you are in control, although the cognitive load is much higher as well than just "blind trust of what is delivered".

But I can see the carnage with offshoring+LLM, or "most employees", including so call software engineer + LLM.

loading story #47286699

loading story #47285124

loading story #47288319

loading story #47285412

queenkjuul6 hours ago | root | parent

You've worked at some shitty places. Nothing I've seen from Claude matches even my worst coworker (and my last job was an F500)

loading story #47284795

loading story #47284929

loading story #47284796

loading story #47287558

loading story #47285608

loading story #47283868

loading story #47284124

loading story #47283958

loading story #47289277

loading story #47285271

Implicated17 hours ago | parent | next

Not trying to be snarky, with all due respect... this is a skill issue.

It's a tool. It's a wildly effective and capable tool. I don't know how or why I have such a wildly different experience than so many that describe their experiences in a similar manner... but... nearly every time I come to the same conclusion that the input determines the output.

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Yes, when the prompt/instructions are overly broad and there's no set of guardrails or guidelines that indicate how things should be done... this will happen. If you're not using planning mode, skill issue. You have to get all this stuff wrapped up and sorted before the implementation begins. If the implementation ends up being done in a "not-so-great" approach - that's on you.

> If you tell them the code is slow

Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized. Then you discuss those options with it (this is where you do the part from the previous paragraph, where you direct the approach so it doesn't do "no-so-great approach") until you get to a point where you like the approach and the model has shown it understands what's going on.

Then you accept the plan and let the model start work. At this point you should have essentially directed the approach and ensured that it's not doing anything stupid. It will then just execute, it'll stay within the parameters/bounds of the plan you established (unless you take it off the rails with a bunch of open ended feedback like telling it that it's buggy instead of being specific about bugs and how you expect them to be resolved).

> you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

This is an area I will agree that the models are wildly inept. Someone needs to study what it is about tests and testing environments and mocking things that just makes these things go off the rails. The solution to this is the same as the solution to the issue of it keeping digging or chasing it's tail in circles... Early in the prompt/conversation/message that sets the approach/intent/task you state your expectations for the final result. Define the output early, then describe/provide context/etc. The earlier in the prompt/conversation the "requirements" are set the more sticky they'll be.

And this is exactly the same for the tests. Either write your own tests and have the models build the feature from the test or have the model build the tests first as part of the planned output and then fill in the functionality from the pre-defined test. Be very specific about how your testing system/environment is setup and any time you run into an issue testing related have the model make a note about that and the solution in a TESTING.md document. In your AGENTS.md or CLAUDE.md or whatever indicate that if the model is working with tests it should refer to the TESTING.md document for notes about the testing setup.

Personally, I focus on the functionality, get things integrated and working to the point I'm ready to push it to a staging or production (yolo) environment and _then_ have the model analyze that working system/solution/feature/whatever and write tests. Generally my notes on the testing environment to the model are something along the lines of a paragraph describing the basic testing flow/process/framework in use and how I'd like things to work.

The more you stick to convention the better off you'll be. And use planning mode.

riffraff15 hours ago | root | parent | next

> Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results?

Yes? Why don't you?

They are capable people that just didn't notice something, id I notice some telemetry and tell them "hey this is slow" they are expected to understand the reason(s).

loading story #47285329

bryanrasmussen14 hours ago | root | parent | next

Yeah if my co-worker can't start figuring out why the code is slow, with a reasonable reference to what the code in question is, that is a knock against their skills. I would actually expect some ideas as to what the problem is just off the top of their heads, but that the coding agent can't do that isn't a hit against it specifically, this is now a good part of what needs to be done differently.

The suggestion to tell the agent to do performance analysis of the part of the code you think is problematic, and offer suggestions for improvements seems like the proper way to talk to a machine, whereas "hey your code is slow" feels like the proper way to talk to a human.

loading story #47285848

loading story #47285667

girvo13 hours ago | root | parent | next

I absolutely tell a coworker their code is slow and expect them to fix it…

loading story #47285769

loading story #47285078

loading story #47285483

loading story #47284025

loading story #47288779

loading story #47284672

loading story #47288312

loading story #47286733

loading story #47287526

loading story #47286205

loading story #47285315

loading story #47286966

loading story #47284399

loading story #47284122

treetalker6 hours ago | parent | next

This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.

The catch is that many judges lack the time, energy, or willingness to not only read the documents in detail, but also roll up their sleeves and dig into the arguments and cited authorities. (Some lack the skills, but those are extreme cases.) So the plausible argument (improperly and unfortunately) carries the day.

LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute. (It's a species of Brandolini's Law / The Bullshit Asymmetry Principle.) Thus justice suffers.

I imagine that this is analogous to the cognitive, technical, and "sub-optimal code" debt that LLM-produced code is generating and foisting upon future developers who will have to unravel it.

loading story #47288755

loading story #47288777

loading story #47288750

grey-area13 hours ago | parent | next

This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow). Doesn't use is_ipk to identify primary keys, uses fsync on every statement. The problem with larger projects like this even if you are competent is that there are just too many lines of code to read it properly and understand it all. Bravo to the author for taking the time to read this project, most people never will (clearly including the author of it).

I find LLMs at present work best as autocomplete -

The chunks of code are small and can be carefully reviewed at the point of writing

Claude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocomplete

That way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated. They make mistakes I'd say 30% of the time or so when autocompleting, which is significant (mistakes not necessarily being bugs but ugly code, slow code, duplicate code or incorrect code.

Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.

D-Machine17 hours ago | parent | next

This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.

They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.

loading story #47285691

loading story #47284537

consumer45113 hours ago | parent | next

Nitpick/question: the "LLM" is what you get via raw API call, correct?

If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

An example I keep hearing is "LLMs have no memory/understanding of time so ___" - but, agents have various levels of memory.

I keep trying to explain this in meetings, and in rando comments. If I am not way off-base here, then what should be the term, or terms, be? LLM-based agents?

loading story #47285851

loading story #47287489

loading story #47285806

seanmcdirmid6 hours ago | parent | next

Ok, I’ll bite: how is that different from humans?

strken6 hours ago | parent | next

Human behaviour is goal-directed because humans have executive function. When you turn off executive function by going to sleep, your brain will spit out dreams. Dream logic is famous for being plausible but unhinged.

I have the feeling that LLMs are effectively running on dream logic, and everything we've done to make them reason properly is insufficient to bring them up to human level.

seanmcdirmid5 hours ago | root | parent | next

Isn’t a modern LLM with thinking tokens fairly goal directed? But yes, we hallucinate in our sleep while LLMs will hallucinate details if the prompt isn’t grounded enough.

zarzavat5 hours ago | root | parent | next

The thing about dream logic is that it can be a completely rational series of steps, but there's usually a giant plot hole which you only realise the second you wake up.

This definitely matches my experience of talking to AI agents and chatbots. They can be extremely knowledgeable on arcane matters yet need to have obvious (to humans) assumptions pointed out to them, since they only have book smarts and not street smarts.

tovej5 hours ago | root | parent

Assuming this is not a rhetorical question: no, it is not. The only "goal" is to maximize plausibility.

satvikpendem6 hours ago | root | parent | next

A prompt for an LLM is also a goal direction and it'll produce code towards that goal. In the end, it's the human directing it, and the AI is a tool whose code needs review, same as it always has been.

whoamii6 hours ago | root | parent | next

Some of my best code comes from my dreams though.

tsunamifury6 hours ago | root | parent | next

It’s amazing how much you get wrong here. As LLM attention layers are stacked goal functions.

What they lack is multi turn long walk goal functions — which is being solved to some degree by agents.

nemo44x6 hours ago | root | parent | next

LLMs are literally goal machines. It’s all they do. So it’s important that you input specific goals for them to work towards. It’s also why logically you want to break the problem into many small problems with concrete goals.

andai6 hours ago | root | parent

Do you only mean instruct-tuned LLMs? Or the base (pretrained) model too?

spiderfarmer6 hours ago | root | parent

And yet LLM’s are incredibly useful as they are right now.

loading story #47289033

wood_spirit6 hours ago | parent | next

It’s not. LLMs are just averaging their internet snapshot, after all.

But people want an AI that is objective and right. HN is where people who know the distinction hang out, but it’s not what the layperson things they are getting when they use this miraculous super hyped tool that everybody is raving about?

mrwh6 hours ago | root | parent | next

The etiquette, even at the bigtech place I work, has changed so quickly. The idea that it would be _embarrassing_ to send a code review with obvious or even subtle errors is disappearing. More work is being put on the reviewer. Which might even be fine if we made the further change that _credit goes to the reviewer_. But if anything we're heading in the opposite direction, lines of code pumped out as the criterion of success. It's like a car company that touts how _much_ gas its cars use, not how little.

wood_spirit5 hours ago | root | parent

Review is usually delegated to an AI too

satvikpendem6 hours ago | root | parent | next

By now, a few years after ChatGPT released, I don't think anyone is thinking AI is objective and right, all users have seen at least one instance of hallucination and simply being wrong.

wood_spirit6 hours ago | root | parent

Sorry I can think of so many counter examples. I also detect a lot of “well it hallucinates about subject X (that the person knows well, so can spot the hallucination)” but continue to trust it on subjects Y and Z (which the person knows less well so can’t spot the hallucinations).

YMMV.

andai6 hours ago | root | parent | next

> Briefly stated, the Gell-Mann Amnesia effect works as follows. You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward-reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them. In any case, you read with exasperation or amusement the multiple errors in a story-and then turn the page to national or international affairs, and read with renewed interest as if the rest of the newspaper was somehow more accurate about far-off Palestine than it was about the story you just read. You turn the page, and forget what you know.

-Michael Crichton

satvikpendem6 hours ago | root | parent

Sure, Gell-Mann amnesia exists, but remember that its origin is actually human, in the form of newspaper writers. So, how can we trust humans the same way? In just the same way, AI cannot also be fully trusted.

wood_spirit5 hours ago | root | parent

The current way of doing AI cannot be trusted.

that doesn’t mean the future won’t herald a way of using what a transformer is good at - interfacing with humans - to translate to and interact with something that can be a lot more sound and objective.

seanmcdirmid5 hours ago | root | parent

There are a lot of binary thinkers on HN, but they shouldn’t make up a majority.

rDr4g0n6 hours ago | parent | next

It's much easier to fire an employee which produces low quality/effort work than to convince leadership to fire Claude.

loading story #47288889

apical_dendrite6 hours ago | parent | next

The volume is different. Someone submitted a PR this week that was 3800 lines of shell script. Most of it was crap and none of it should have been in shell script. He's submitting PRs with thousands of lines of code every day. He has no idea how any of it actually works, and it completely overwhelms my ability to review.

Sure, he could have submitted a ill-considered 3800 line PR five years ago, but it would have taken him at least a week and there probably would have been opportunities to submit smaller chunks along the way or discuss the approach.

switchbak5 hours ago | root | parent | next

It’s harder when the person doing what you describe has the ability to have you fired. Power asymmetry + irresponsible AI use + no accountability = a recipe for a code base going right to hell in a few months.

I think we’re going to see a lot of the systems we depend on fail a lot more often. You’d often see an ATM or flight staus screen have a BSOD - I think we’re going to see that kind of thing everywhere soon.

satvikpendem6 hours ago | root | parent

Just block that user, that seems to be the way.

somewhereoutth6 hours ago | parent

Humans have a 'world model' beyond the syntax - for code, an idea of what the code should do and how it does it. Of course, some humans are better than others at this, they are recognized as good programmers.

satvikpendem6 hours ago | root | parent

Papers show that AI also has a world model, so I don't think that's the right distinction.

tovej5 hours ago | root | parent

Could you please cite these papers. If by AI you mean LLMs, that is not supported by what I know. If you mean a theoretical world-model-based AI, that's just a tautological statement.

loading story #47288993

loading story #47286049

andai6 hours ago | parent | next

It writes statistically represented code, which is why (unless instructed otherwise) everything defaults to enterprisey, OOP, "I installed 10 trendy dependencies, please hire me" type code.

loading story #47290460

loading story #47290224

loading story #47287715

comex19 hours ago | parent | next

Based on a search, the SQLite reimplementation in question is Frankensqlite, featured on Hacker News a few days ago (but flagged):

https://news.ycombinator.com/item?id=47176209

flerchin20 hours ago | parent | next

Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.

loading story #47283775

loading story #47290171

loading story #47286668

loading story #47289295

loading story #47290235

loading story #47288768

loading story #47288257

loading story #47289804

lukeify19 hours ago | parent | next

Most humans also write plausible code.

loading story #47283816

loading story #47283740

jqpabc12319 hours ago | parent | next

LLMs have no idea what "correct" means.

Anything they happen to get "correct" is the result of probability applied to their large training database.

Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.

Relying on an LLM for anything "serious" is a liability issue waiting to happen.

loading story #47283994

loading story #47283948

loading story #47283993

loading story #47284776

loading story #47288058

loading story #47284629

loading story #47284983

loading story #47286127

marginalia_nu20 hours ago | parent | next

I tried to make Claude Code, Sonnet 4.6, write a program that draws a fleur-de-lis.

No exaggeration it floundered for an hour before it started to look right.

It's really not good at tasks it has not seen before.

loading story #47283777

loading story #47283767

loading story #47283724

loading story #47283934

loading story #47283679

loading story #47287828

raw_anon_111118 hours ago | parent | next

The difference for me recently

Write a lambda that takes an S3 PUT event and inserts the rows of a comma separated file into a Postgres database.

Naive implementation: download the file from s3 and do a bulk insert - it would have taken 20 minutes and what Claude did at first.

I had to tell it to use the AWS sql extension to Postgres that will load a file directly from S3 into a table. It took 20 seconds.

I treat coding agents like junior developers.

loading story #47284389

loading story #47284221

loading story #47285486

codethief18 hours ago | parent | next

> Your LLM Doesn't Write Correct Code. It Writes Plausible Code.

I don't always write correct code, either. My code sure as hell is plausible but it might still contain subtle bugs every now and then.

In other words: 100% correctness was never the bar LLMs need to pass. They just need to come close enough.

skybrian19 hours ago | parent | next

You can ask an LLM to write benchmarks and to make the code faster. It will find and fix simple performance issues - the low-hanging fruit. If you want it to do better, you can give it better tools and more guidance.

It's probably a good idea to improve your test suite first, to preserve correctness.

ontouchstart19 hours ago | parent | next

I made a comment in another thread about my acceptance criteria

https://news.ycombinator.com/item?id=47280645

It is more about LLMs helping me understand the problem than giving me over engineered cookie cutter solutions.

19 hours ago | parent | next

{"deleted":true,"id":47283785,"parent":47283337,"time":1772850021,"type":"comment"}

loading story #47284774

graphememes18 hours ago | parent | next

bad input > bad output

idk what to say, just because it's rust doesn't mean it's performant, or that you asked for it to be performant.

yes, llms can produce bad code, they can also produce good code, just like people

FrankWilhoit19 hours ago | parent | next

Enterprise customers don't buy correct code, they buy plausible code.

loading story #47283746

loading story #47284056

loading story #47283762

loading story #47285178

bitwize6 hours ago | parent | next

You: Claude, do you know how to program?

Claude: No, but if you hum a few bars I can fake it!

Except "faking it" turns out to be good enough, especially if you can fake it at speed and get feedback as to whether it works. You can then just hillclimb your way to an acceptable solution.

loading story #47288821

loading story #47285758

loading story #47284741

gzread19 hours ago | parent | next

Early LLMs would do better at a task if you prefixed the task with "You are an expert [task doer]"

loading story #47289146

mmaunder19 hours ago | parent | next

But my AI didn't do what your AI did.

Cherry picked AI fail for upvotes. Which you’ll get plenty of here an on Reddit from those too lazy to go and take a look for themselves.

Using Codex or Claude to write and optimize high performance code is a game changer. Try optimizing cuda using nsys, for example. It’ll blow your lazy little brain.

loading story #47284014

loading story #47283789

loading story #47284844

loading story #47289103

18 hours ago | parent | next

{"deleted":true,"id":47284371,"parent":47283337,"time":1772856330,"type":"comment"}

loading story #47289012

loading story #47286286

loading story #47287040

loading story #47285261

loading story #47284898

loading story #47285144

loading story #47285676

cat_plus_plus19 hours ago | parent | next

That's very impressive. Your LLM actually wrote a correct code for a full relational database on the first try, like it takes 2.5 seconds to insert 100 rows but it stores them correctly and select is pretty fast. How many humans can do this without a week of debugging? I would suggest you install some profiling tools and ask it to find and address hotspots. SQL Lite had how long and how many people to get to where it is?

loading story #47283780

loading story #47288206

shablulman6 hours ago | parent | next

[dead]

loading story #47284785

loading story #47285289

loading story #47285709

thisguySPED19 hours ago | parent | next

[flagged]

thisguySPED19 hours ago | parent | next

[flagged]

loading story #47284572

satvikpendem6 hours ago | parent | next

Oftentimes, plausible code is good enough, hence why people keep using AI to generate code. This is a distinction without a difference.

loading story #47288807

loading story #47288833

serious_angel19 hours ago | parent

Holy gracious sakes... Of course... Thank you... thank you... dear katanaquant, from the depths... of my heart... There's still belief in accountability... in fun... in value... in effort... in purpose... in human... in art...

- <http://archive.today/2026.03.07-020941/https://lr0.org/blog/...> (I'm not consulting an LLM...)

- <https://web.archive.org/web/20241021113145/https://slopwatch...>

#visit	13,002,299
#session	74,665
#live-session	0