Hacker News new | past | comments | ask | show | jobs | submit
Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.

This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.

We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.

Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.

They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.

To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.

This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.

Should this be the case, I personally would not be surprised:

- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.

- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.

loading story #47402505
loading story #47396432
loading story #47396949
loading story #47400526
There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well.
Its also inevitable given that we still don't even really know how these models work or what they do at inference time.

We know input/output pairs, when using a reasoning model we can see a separate stream of text that is supposedly insight into what the model is "thinking" during inference, and when using multiple agents we see what text they send to each other. That's it.

I think this is just anthropomorphism. Sub agents make sense as a context saving mechanism.

Aider did an "architect-editor" split where architect is just a "programmer" who doesn't bother about formatting the changes as diff, then a weak model converts them into diffs and they got better results with it. This is nothing like human teams though.

loading story #47402396
> what's the evidence

What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.

In my experience, evidence for the efficacy of software engineering practices falls into two categories:

- the intuitions of developers, based in their experiences.

- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.

Evidence for this LLM pattern is the same. Some developers have an intuition it works better.

My friend, there’s tons of evidence of all that stuff you talked about in hundreds of papers on arxiv. But you dismiss it entirely in your second bullet point, so I’m not entirely sure what you expect.
loading story #47399031
You can measure customer facing defects.

Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.

Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.

> Also, lines of code is not completely meaningless metric.

Comparing lines of code can be meaningful, mostly if you can keep a lot of other things constant, like coding style, developer experience, domain, tech stack. There are many style differences between LLM and human generated code, so that I expect 1000 lines of LLM code do a lot less than 1000 lines of human code, even in the exact same codebase.

{"deleted":true,"id":47396924,"parent":47395865,"time":1773654262,"type":"comment"}
The proper metric is the defect escape rate.
Now you have to count defects
loading story #47396087
Most developer intuitions are wrong.

See: OOP

Intuition is subjective. It's hard to convert subjective experience to objective facts.
loading story #47396739
The different models is a big one. In my workflow, I've got opus doing the deep thinking, and kimi doing the implementation. It helps manage costs.

Sample size of one, but I found it helps guard against the model drifting off. My different agents have different permissions. The worker can not edit the plan. The QA or planner can't modify the code. This is something I sometimes catch codex doing, modifying unrelated stuff while working.

loading story #47396771
After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.

Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.

Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.

What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."

"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.

Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.

I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture
Wouldn’t skills already solve this? A harness can start a new agent with a specific skill if it thinks that makes sense.
It's not about splitting for quality, it's about cost optimisation (Sonnet implements, which is cheaper). The quality comes with the reviewers.

Notice that I didn't split out any roles that use the same model, as I don't think it makes sense to use new roles just to use roles.

Nitpick: I don’t think architect is a good name for this role. It’s more of a technical project kickoff function: these are the things we anticipate we need to do, these are the risks etc.

I do find it different from the thinking that one does when writing code so I’m not surprised to find it useful to separate the step into different context, with different tools.

Is it useful to tell something “you are an architect?” I doubt it but I don’t have proof apart from getting reasonable results without it.

With human teams I expect every developer to learn how to do this, for their own good and to prevent bottlenecks on one person. I usually find this to be a signal of good outcomes and so I question the wisdom of biasing the LLM towards training data that originates in spaces where “architect” is a job title.

> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

There's a 63 pages paper with mathematical proof if you really into this.

https://arxiv.org/html/2601.03220v1

My takeaway: AI learns from real-world texts, and real-world corpus are used to have a role split of architect/developer/reviewer

>> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

> There's a 63 page paper with mathematical proof if you really into this.

> https://arxiv.org/html/2601.03220v1

I'm confused. The linked paper is not primarily a mathematics paper, and to the extent that it is, proves nothing remotely like the question that was asked.

> proves nothing remotely like the question that was asked

I am not an expert, but by my understanding, the paper prooves that a computationally bounded "observer" may fail to extract all the structure present in the model in one computation. aka you can't always one-shot perfect code.

However, arrange many pipelines of roles "observers" may gradually get you there

Perhaps this paper might be more relevant with regards to multi-agent pipelines https://arxiv.org/html/2404.04834v4
Can you explain how this paper is relevant to the comment you replied to?
Yeah always seemed pretty sus to me to.

At the same time I can see a more linear approach doing similar. Like when I ask for an implementation plan that is functional not all that different from an architect agent even if not wrapped in such a persona

In machine learning, ensembles of weaker models can outperform a single strong model because they have different distributions of errors. Machine learning models tend to have more pronounced bias in their design than LLMs though.

So to me it makes sense to have models with different architecture/data/post training refine each other's answers. I have no idea whether adding the personas would be expected to make a difference though.

One added benefit is it allows you to throw more tokens to the problem. It’s the most impactful benefit even.

Context & how LLMs work requires this.

From my experience no frontier model produces bug free & error free code with the first pass, no matter how much planning you do beforehand.

With 3 tiers, you spend your token & context budget in full in 3 phases. Plan, implement, review.

If the feature is complex, multiple round of reviews, from scratch.

It works.

“…if you give it good context…” that’s what the architect session is for basically. You throw around ideas and store the direction you want to go.

Then you execute it with a clean context.

Clean context is needed for maximum performance while not remembering implementation dead ends you already discarded

If you know what you need, my experience is that a well-formed single-prompt that fits the context gives the best results (and fastest).

If you’re exploring an idea or iterating, the roles can help break it down and understand your own requirements. Personally I do that “away” from the code though.

> Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something.

Even for reducing the context size it's probably worth it. If you have to go back back and forth on both problem and implementation even with these new "large" contexts if find quality degrading pretty fast.
loading story #47399670
The agent "personalities" and LLM workflow really looks like cargo-cult behavior. It looks like it should be better but we don't really have data backing this.
> produces better results than just... talking to one strong model in one session?

I think the author admits that it doesn't, doesn't realise it and just goes on:

--- start quote ---

On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet

--- end quote ---

I have been using different models for the same role - asking (say) Gemini, then, if I don't like the answer asking Claude, then telling each LLM what the other one said to see where it all ends up

Well I was until the session limit for a week kicked in.

Evidence? My friend, most of the practices in this field are promoted and adopted based on hand-waving, feelings, and anecdata from influencers.

Maybe you should write and share your own article to counter this one.

Also if something is fun, we prefer to to it that way instead of the boring way. Then it depends on how many mines you step on, after a while you try to avoid the mines. That's when your productivity goes down radically. If we see something shiny we'll happily run over the minefield again though.