Story Detail of id 47349741 | Liveview Hacker News

ryanackley11 hours ago | on: Are LLM merge rates not getting better?

I agree completely. I haven't noticed much improvement in coding ability in the last year. I'm using frontier models.

What's been the game changer are tools like Claude Code. Automatic agentic tool loops purpose built for coding. This is what I have seen as the impetus for mainstream adoption rather than noticeable improvements in ability.

sho_hn10 hours ago | parent | next

My anecdotal experience is rather different.

I write a lot of C++ and QML code. Codex 5.3, only released in Feb, is the the first model I've used that would regularly generate code that passes my 25 years expert smell test and has turned generative coding from a timesap/nuisance into a tool I can somewhat rely on not to set me back.

Claude still wasn't quite there at the time, but I haven't tried 4.6 yet.

QML is a declarative-first markup language that is a superset of the JavaScript syntax. It's niche and doesn't have a giant amount of training data in the corpus. Codex 5.3 is the first model that doesn't super botch it or prefers to write reams of procedural JS embeds (yes, after steering). Much reduced is also the tendency to go overboard on spamming everything with clouds of helper functions/methods in both C++ and QML. It knows when to stop, so to speak, and is either trained or able to reason toward a more idiomatic ideal, with far less explicit instruction / AGENTS.md wrangling.

It's a huge difference. It might be the result of very specific optimization, or perhaps simultaneous advancements in the harness play a bigger role, but in my books my kneck of the woods (or place on the long tail) only really came online in 2026 as far as LLMs are concerned.

loading story #47353705

mavamaarten11 hours ago | parent

Maybe n=1, but I disagree? I notice that Sonnet 4.6 follows instructions much better than 4.5 and it generates code much closer to our already in-place production code.

It's just a point release and it isn't a significant upgrade in terms of features or capabilities, but it works... better for me.

loading story #47349940

#visit	13,080,491
#session	74,665
#live-session	0