Hacker News new | past | comments | ask | show | jobs | submit
Focusing on flashy breakthroughs hides the issue that bigger models and merge benchmarks rarely translate to reliability in real codebases. For routine merges, subtle regressions and context quirks matter more than headline progress. Unless evals stress nasty scenarios like multi-file renames with tricky conflicts, the numbers are mostly for show. Progress will plateau until someone tunes for the boring, messy cases that waste dev time.