It's constantly turning what should be 50 LOC patch of a single prompt into 30 minute exploration that is totally not worth it. Often wrong even.
I trialed it on some rather simple stuff - backfill redis dedupe cache when the hash function changed: instead of running new hash func on every db value to expand the cache it implemented some overly-complex cache update that tried to guess hashing func version of each cached value and recalculate only the old hashes. I can imagine in some context this would make sense maybe? but not 30 minutes of token burn that got replaced by 10 lines for loop by me.
I fear that this is generally bad news for programming. LLM tech is clearly running into a diminishing returns wall on intelligence but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved, except I guess people who sell the tokens and people who can afford these tokens to scan for 0-days.
I see two problems with LLMs & agents which wont be fixed possibly forever.
1) They dont have causal models. What they can do only is trial-and-error exploration which works quite well for many problems. But many other problems require a causal model.
2) Prompts lack precision, and programming languages and machine models were invented to solve this problem. English is great, but it is not a programming language.
They’ve been doing a lot of strategic introduction and manipulation in the run up to the IPO, and it’s worked in that regard.
It was actually pretty maddening as what should have taken a minute or two tops took like 10 because it went down this route.
I'm gonna try something much more complex later, but for simple things, it felt like driving a corvette to the mailbox.