Constraint Decay: The Fragility of LLM Agents in Back End Code Generation

152wek | 10 hours ago | 75 | HN

“Our systematic study exposes a phenomenon of constraint decay in LLM-based coding agents. While current models excel at unconstrained generation, their performance drops when forced to navigate explicit architectural rules. For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.”

One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.

loading story #48258499

loading story #48258540

maxbond8 hours ago | parent | next

Reminds me of the recent paper about delegating document editing tasks to LLMs across different disciplines [1]. That paper found that programming was the only discipline most LLMs can perform long horizon tasks on without accumulating errors & corrupting the document.

I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.

[1] https://arxiv.org/abs/2604.15597

Discussion: https://news.ycombinator.com/item?id=48073246

loading story #48258177

loading story #48261648

loading story #48259166

dwa35927 hours ago | parent | next

This sounds like another version of "As a chat becomes longer, the guardrails seem to become fuzzy". You can't use all of the context window bc at the end, the output would not respect the constraints (or guardrails) but to reliably produce production grade code you want the model to have expansive awareness which fills up the context window pretty quickly. It's like saying "Keep everything in mind from these 6 directories - and make this <insert ticket> change" - but keeping everything in mind already fills it's context window which makes it lose it's ability to follow the constraints (or guardrails).

loading story #48258400

loading story #48260592

p0w3n3d7 hours ago | parent | next

   tasks spanning eight web frameworks

Does anyone else have this experience that LLM create better pure html+CSS+js than work with existing frameworks?

loading story #48260607

loading story #48260129

yomismoaqui8 hours ago | parent | next

Also they used languages with dynamic typing like Python & JS. In my experience a statically typed codebase is easier to maintain for humans so maybe it is also for agents.

When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.

loading story #48258273

gkfasdfasdf8 hours ago | parent | next

Odd they used GPT-5.2 and not GPT-5.2-codex. i.e. the one optimized for coding agent tasks.

loading story #48260099

loading story #48260103

bob10297 hours ago | parent | next

> Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline.

I have exactly the inverse findings on my end. The bigger and more legacy the codebase, the more accurate the patches become.

The harness itself seems to be the most important part. I use a recursive loop that primes the root context based on the user prompt each time. My agent will often make over 100 tool calls to sql and git before it finally decides to apply a patch. If I was greenfield, there would be nothing to query or constrain against.

loading story #48258385

leecommamichael7 hours ago | parent | next

These things don’t think. We’re going to have to reiterate this for a long time, I fear.

sheeshkebab7 hours ago | parent | next

…but they reason well enough given enough context (using their matmuls).

noosphr7 hours ago | root | parent

To this day frontier models think that A and not B means A and B when the sentence gets pushed far enough back in their context window. The context length that model can reason over without obvious errors is much smaller than the advertised context. Between a 1/4th to a 1/20th what is advertised on the tin.

emp173447 hours ago | parent

There is now a trillion-dollar industry bent to the task of convincing people these things can think. It’s gonna cause some damage.

rbbydotdev7 hours ago | parent | next

This is interesting, anecdotally I have felt like I was having better luck with raw sqlite than using an ORM in a recent typescript project, using raw sqlite queries vs drizzle

loading story #48258744

loading story #48258706

loading story #48261751

volume_tech10 hours ago | parent | next

[flagged]

loading story #48261133

loading story #48259102

#visit	13,352,486
#session	74,665
#live-session	0