Hacker News new | past | comments | ask | show | jobs | submit

How I write software with LLMs

https://www.stavros.io/posts/how-i-write-software-with-llms/
Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.

loading story #47396198
loading story #47397321
There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well.
loading story #47397366
loading story #47396875
loading story #47396548
loading story #47397185
I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture
loading story #47396214
> what's the evidence

What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.

In my experience, evidence for the efficacy of software engineering practices falls into two categories:

- the intuitions of developers, based in their experiences.

- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.

Evidence for this LLM pattern is the same. Some developers have an intuition it works better.

loading story #47395951
loading story #47395954
loading story #47396924
loading story #47396026
loading story #47396149
loading story #47397040
loading story #47396433
After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.

Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.

Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.

What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."

"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.

Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.

loading story #47396290
loading story #47396591
One added benefit is it allows you to throw more tokens to the problem. It’s the most impactful benefit even.

Context & how LLMs work requires this.

From my experience no frontier model produces bug free & error free code with the first pass, no matter how much planning you do beforehand.

With 3 tiers, you spend your token & context budget in full in 3 phases. Plan, implement, review.

If the feature is complex, multiple round of reviews, from scratch.

It works.

> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

There's a 63 pages paper with mathematical proof if you really into this.

https://arxiv.org/html/2601.03220v1

My takeaway: AI learns from real-world texts, and real-world corpus are used to have a role split of architect/developer/reviewer

loading story #47396131
loading story #47396153
> Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something.

loading story #47396653
I have been using different models for the same role - asking (say) Gemini, then, if I don't like the answer asking Claude, then telling each LLM what the other one said to see where it all ends up

Well I was until the session limit for a week kicked in.

> produces better results than just... talking to one strong model in one session?

I think the author admits that it doesn't, doesn't realise it and just goes on:

--- start quote ---

On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet

--- end quote ---

Evidence? My friend, most of the practices in this field are promoted and adopted based on hand-waving, feelings, and anecdata from influencers.

Maybe you should write and share your own article to counter this one.

loading story #47395923
I randomly clicked and scrolled through the source code of Stavrobot - The largest thing I’ve built lately is an alternative to OpenClaw that focuses on security. [1] and that is not great code. I have not used any AI to write code yet but considered trying it out - is this the kind of code I should expect? Or maybe the other way around, has someone an example of some non-trivial code - in size and complexity - written by an AI - without babysitting - and the code being really good?

[1] https://github.com/skorokithakis/stavrobot

I would suggest not delegating the LLD (class / interface level design) to the LLM. The clankeren are super bad at it. They treat everything as a disposable script.

Also document some best practices in AGENT.md or whatever it's called in your app.

Eg

    * All imports must be added on top of the file, NEVER inside the function.
    * Do not swallow exceptions unless the scenario calls for fault tolerance.
    * All functions need to have type annotations for parameters and return types.
And so on.

I almost always define the class-level design myself. In some sense I use the LLM to fill in the blanks. The design is still mine.

loading story #47397084
From my experience, you kinda get what you ask for. If you don't ask for anything specific, it'll write as it sees fit. The more you involve yourself in the loop, the more you can get it to write according to your expectation. Also helps to give it a style guide of sorts that follows your preferred style.
loading story #47399561
loading story #47398803
Pine Town [1], the "whimsical infinite multiplayer canvas of a meadow", also looks like pure slop.

[1] https://pine.town/

It's the kind of code you should expect if you don't run a harness that includes review and refactoring stages.

It's by no means the best LLMs can do.

You can make it better by investing a lot of time playing around with the tooling so that it produces something more akin to what you're looking for.

Good luck convincing your boss that this ungodly amount of time spent messing around with your tooling for an immeasurable improvement in your delivery is the time well spent as opposed to using that same amount of time delivering results by hand.

loading story #47399355
> is this the kind of code I should expect?

Sadly yes. But it "works", for some definition of working. We all know it's going to be a maintenance nightmare seen the gigantic amount of code and projects now being generated ad infinitum. As someone commented in this thread: it can one-shot an app showing restaurant locations on a map and put a green icon if they're open. But don't except good code, secure code, performant code and certainly not "maintainable code".

By definition, unless the AIs can maintain that code, nothing is maintainable anymore: the reason being the sheer volume. Humans who could properly review and maintain code (and that's not many) are already outnumbered.

And as more and more become "prompt engineers" and are convinced that there's no need to learn anything anymore besides becoming a prompt engineer, the amount of generated code is only going to grow exponentially.

So to me it is the kind of code you should expect. It's not perfect. But it more or less works. And thankfully it shouldn't get worse with future models.

What we now need is tools, tools and more tools: to help keep these things on tracks. If we ever to get some peace of mind about the correctness of this unreviewable generate code, we'll need to automate things like theorem provers and code coverage (which are still nowhere to be seen).

And just like all these models are running on Linux and QEMU and Docker (dev container) and heavily using projects like ripgrep (Claude Code insist on having ripgrep installed), I'm pretty sure all these tools these models rely on and shall rely on to produce acceptable results are going to be, very mostly, written by humans.

I don't know how to put it nicely: an app showing green icon next to open restaurants on a map ain't exactly software to help lift off a rocket or to pilot a MRI machine.

BTW: yup, I do have and use Claude Code. Color me both impressed and horrified by the "working" amount un unmaintainable mess it can spout. Everybody who understands something about software maintenance should be horrified.

loading story #47399321
I also managed to find a 1000 line .cpp file in one of the projects. The article's content doesn't match his apps quality. They don't bring any value. His clock looks completely AI generated.
Remember you're grinding your anti-LLM axe against something a real person made, and that person read your comment.
{"deleted":true,"id":47397468,"parent":47397008,"time":1773659209,"type":"comment"}
> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.

It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.

In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.

loading story #47397237
loading story #47397419
loading story #47397354
loading story #47397184
loading story #47397248
loading story #47397290
loading story #47397303
loading story #47397201
loading story #47398509
loading story #47402567
loading story #47402612
loading story #47400459
In the plethora of all these articles that explain the process of building projects with LLMs, one thing I never understood it why the authors seem to write the prompts as if talking to a human that cares how good their grammar or syntax is, e.g.:

> I'd like to add email support to this bot. Let's think through how we would do this.

and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).

Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)

loading story #47396454
loading story #47396456
loading story #47396383
loading story #47396464
loading story #47396371
loading story #47396835
loading story #47396302
loading story #47396785
loading story #47396599
loading story #47396459
loading story #47396651
loading story #47396416
loading story #47396670
loading story #47396443
loading story #47396521
loading story #47396429
loading story #47402034
I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.

I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.

Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA

1. https://github.com/humanlayer/advanced-context-engineering-f...

that reference you give is pretty dated now, based on a talk from August which is the Beforetimes of the newer models that have given such a step change in productivity.

The key change I've found is really around orchestration - as TFA says, you don't run the prompt yourself. The orchestrator runs the whole thing. It gets you to talk to the architect/planner, then the output of that plan is sent to another agent, automatically. In his case he's using an architect, a developer, and some reviewers. I've been using a Superpowers-based [0] orchestration system, which runs a brainstorm, then a design plan, then an implementation plan, then some devs, then some reviewers, and loops back to the implementation plan to check progress and correctness.

It's actually fun. I've been coding for 40+ years now, and I'm enjoying this :)

[0] https://github.com/obra/superpowers

Can you bolt superpowers onto an existing project so that it uses the approach going forward (I'm using Opencode), or would that get too messy?
Yes. But gsd is even better - especially gsd2
re: breaking into specialized subagents -- yes, it matters significantly but the splitting criteria isn't obvious at first.

what we found: split on domain of side effects, not on task complexity. a "researcher" agent that only reads and a "writer" agent that only publishes can share context freely because only one of them has irreversible actions. mixing read + write in one agent makes restart-safety much harder to reason about.

the other practical thing: separate agents with separate context windows helps a lot when you have parts of the graph that are genuinely parallel. a single large agent serializes work it could parallelize, and the latency compounds across the whole pipeline.

loading story #47401198
loading story #47399792
loading story #47401727
I'm not sure the notion I keep seeing of "it's ok, we still architect, it just writes the code"(paraphrased) sits well with me.

I've not tested it with architecting a full system, but assuming it isn't good at it today... it's only a matter of time. Then what is our use?

Others have already partially answered this, but here’s my 20 cents. Software development really is similar to architecture. The end result is an infrastructure of unique modules with different type of connectors (roads, grid, or APIs). Until now in SW dev the grunt work was done mostly by the same people who did the planning, decided on the type of connectors, etc. Real estate architects also use a bunch of software tools to aid them, but there must be a human being in the end of the chain who understands human needs, understands - after years of studying and practicing - how the whole building and the infrastructure will behave at large and who is ultimately responsible for the end result (and hopefully rewarded depending on the complexity and quality of the end result). So yes we will not need as many SW engineers, but those who remain will work on complex rewarding problems and will push the frontier further.
loading story #47396561
loading story #47396059
{"deleted":true,"id":47395581,"parent":47395501,"time":1773640441,"type":"comment"}
LLMs can build anything. The real question is what is worth building, and how it’s delivered. That is what is still human. LLMs, by nature of not being human, cannot understand humans as well as other humans can. (See every attempt at using an LLM as a therapist)

In short: LLMs will eventually be able to architect software. But it’s still just a tool

loading story #47395562
loading story #47395744
> Then what is our use?

You will have to find new economic utility. That's the reality of technological progress - it's just that the tech and white collar industries didn't think it can come for them!

A skill that becomes obsoleted is useless, obviously. There's still room for artisanal/handcrafted wares today, amidst the industrial scale productions, so i would assume similar levels for coding.

loading story #47396286
loading story #47400075
It's interesting to see some patterns starting to emerge. Over time, I ended up with a similar workflow. Instead of using plan files within the repository, I'm using notion as the memory and source of truth.

My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.

I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.

Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:

https://github.com/marcosloic/notion-agent-hive

loading story #47396572
loading story #47399313
I wanted to know how to make softwares with LLM "without losing the benefit of knowing how the entire system works" and "intimately familiar with each project’s architecture and inner workings", while "have never even read most of their code". (Because obviously, you can't.) But OP didn't explain that.

You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.

Hot take: you can't have your cake and eat it too. If you aren't writing code, designing the system, creating architecture, or even writing the prompt, then you're not understanding shit. You're playing slots with stochastic parrots

    The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
- Karpathy 2025
loading story #47395560
loading story #47397178
loading story #47395862
Haha love the Sleight of hand irregular wall clock idea. I once had a wall clock where the hand showing the seconds would sometimes jump backwards, it was extremely unsettling somehow because it was random. It really did make me question my sanity.
loading story #47396487
loading story #47399624
On using different models: GitHub copilot has an API that gives you access to many different models from many different providers. They are very transparent about how they use your data[1]; in some cases it’s safer to use a model through them than through the original provider.

You can point Claude at the copilot models with some hackery[2] and opencode supports copilot models out of the box.

Finally, copilot is quite generous with the amount of usage you get from a Github pro plan (goes really far with Sonnet 4.6 which feels pretty close to Opus 4.5), and they’re generous with their free pro licenses for open source etc.

Despite having stuck to autocomplete as their main feature for too long, this aspect of their service is outstanding.

loading story #47402277
loading story #47399403
Big +1 for opencode which for my purposes is interchangeable or better than Claude and can even use anthropic models via my GitHub copilot pro plan. I use it and Claude when one or the other hits token limits.

Edit: a comment below reminded me why I prefer opencode: a few pages in on a Claude session and it’s scrolling through the entire conversation history on every output character. No such problem on OC.

loading story #47400668
I find the same problem applying to coding too. Even with everyone acting in good faith and reviewing everything themselves before pushing, you have essentially two reviwers instead of a writer and a reviewer, and there is no etiquette mandating how thoroughly the "author" should review their PR yet. It doesn't help if the amount of code to review gets larger (why would you go into agentic coding otherwise?)
We build and run a multi-agent system. Today Cursor won. For a log analysis task — Cursor: 5 minutes. Our pipeline: 30 minutes.

Still a case for it: 1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other 2. Hard permission boundaries per agent 3. Local models (Qwen) for cheap routine tasks

Multi-agent loses at debugging. But the structure has value.

When I use Claude code to work on a hobby project it feels like doom scrolling…

I can’t get my head around if the hobby is the making or the having, but fair to say I’ve felt quite dissatisfied at the end of my hobby sessions lately so leaning towards the former.

This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:

The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.

This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.

Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.

A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).

I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).

One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).

> The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step. > > This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.

Would you please expand on this? Do you make the LLM append their responses to a Markdown file, prefixed by their timestamps, basically preserving the whole context in a file? Or do you make the LLM update some reference files in order to keep a "condensed" context? Thank you.

loading story #47395918
loading story #47396449
loading story #47398717
loading story #47398858
I know the argument I'm going to make is not original, but with every passing week, it's becoming more obvious that if the productivity claims were even half true, those "1000x" LLM shamans would have toppled the economy by now. Were are the slop-coded billion dollar IPOs? We should have one every other week.
loading story #47397204
loading story #47396888
loading story #47397240
Great article. I'd recommmend to make guardrails and benchmarking an integral part of prompt engineering. Think of it as kind of a system prompt to your Opus 4.6 architect: LangChain, RAG, LLm-as-a-judge, MCP. When I think about benchmarks I always ask it to research for external DB or other ressources as a referencing guardrail
loading story #47398882
loading story #47398263
loading story #47398521
I write very little code these days, so I've been following the AI development mostly from the backseat. One aspect I fail to grasp perfectly is what the practical differences are between CLI (so terminal-based) agents and ones fully integrated into an IDE.

Could someone chime in and give their opinion on what are the pros and cons of either approach?

loading story #47397150
loading story #47397202
For me, I use an IDE if I plan to look at the code.
loading story #47396338
loading story #47397055
loading story #47396188
Hi, anyone has a simple example/scaffold how to set up agents/skills like this? I’ve looked at the stavrobots repo and only saw an AGENTS.md. Where do these skills live then?

(I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)

These skills live in my home directory, that's why they aren't in the repos. I can upload them if you want.
Not original commenter, but would be curious (and thankful) to see it.
loading story #47402425
I played with this over the weekend:

https://github.com/marcosloic/notion-agent-hive

Ultimately, it's just a bunch of markdown files that live in an `/agents` folder, with some meta-information that will depend on the harness you use.

Agent bots are the new “TODO” list apps. Seems cool and all, but I wish I could see someone writing useful software with LLMs, at least once.

So much power in our hands, and soon another Facebook will appear built entirely by LLMs. What a fucking waste of time and money.

It’s getting tiring.

loading story #47398166
loading story #47398219
I am enjoying the RePPIT framework from Mihail Eric. I think it’s a better formalization of developing without resulting to personas.
loading story #47397654
> Before that, code would quickly devolve into unmaintainability after two or three days of programming, but now I’ve been working on a few projects for weeks non-stop, growing to tens of thousands of useful lines of code, with each change being as reliable as the first one.

I'm glad it works for the author, I just don't believe that "each change being as reliable as the first one" is true.

> I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.

I agree that knowing the syntax is less important now, but I don't see how the latter claim has changed with the advent of LLMs at all?

> On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.

I think the author is contradicting himself here. Programs written by an LLM in a domain he is not knowledgable about are a mess. Programs written by an LLM in a domain he is knowledgeable about are not a mess. He claims the latter is mostly true because LLMs are so good???

My take after spending ~2 weeks working with Claude full time writing Rust:

- Very good for language level concepts: syntax, how features work, how features compose, what the limitations are, correcting my wrong usage of all of the above, educating me on these things

- Very good as an assistant to talk things through, point out gaps in the design, suggest different ways to architect a solution, suggest libraries etc.

- Good at generating code, that looks great at the first glance, but has many unexplained assumptions and gaps

- Despite lack of access to the compiler (Opus 4.6 via Web), most of the time code compiles or there are trivially fixable issues before it gets to compile

- Has a hard to explain fixation on doing things a certain way, e.g. always wants to use panics on errors (panic!, unreachable!, .expect etc) or wants to do type erasure with Box<dyn Any> as if that was the most idiomatic and desirable way of doing things

- I ended up getting some stuff done, but it was very frustrating and intellectually draining

- The only way I see to get things done to a good standard is to continuously push the model to go deeper and deeper regarding very specific things. "Get x done" and variations of that idea will inevitably lead to stuff that looks nice, but doesn't work.

So... imo it is a new generation compiler + code gen tool, that understands human language. It's pretty great and at the same time it tires me in ways I find hard to explain. If professional programming going forward would mean just talking to a model all day every day, I probably would look for other career options.

loading story #47398742
What's the point of writing this? In a few weeks a new model will come out and make your current work pattern obsolete (a process described in the post itself)
loading story #47397223
Ah, another one of these. I'm eager to learn how a "social climber" talks to a chatbot. I'm sure it's full of novel insight, unlike thousands of other articles like this one.
loading story #47402518
loading story #47402319
loading story #47400533
loading story #47400906
loading story #47400031
loading story #47398398
loading story #47398256
loading story #47400117
loading story #47400992
loading story #47399166
loading story #47397940
loading story #47397513
loading story #47397527
This was on the front page and then got completely buried for some reason. Super weird.
On the front page at the moment. Position 12
Maybe I missed it. Sometimes when you're scanning for something your brain intentionally doesn't want to see it, I've noticed. Anyway I'm not Stavros obviously, just thought this was a good article.
loading story #47396917
TL DR; Don't, please :)