I assume until LLMs are 100% better than humans in all cases, as long as I have to be in the loop there will be a pretty hard upper bound on what I can do and it seems like we’ve roughly hit that limit.
Funny enough, I get this feeling with a lot of modern technology. iPhones, all the modern messaging apps, etc make it much too easy to fragment your attention across a million different things. It’s draining. Much more draining than the old days
I do as well, so totally know what you're talking about. There's part of me that thinks it will become less exhausting with time and practice.
In high school and college I worked at this Italian place that did dine in, togo, and delivery orders. I got hired as a delivery driver and loved it. A couple years in there was a spell where they had really high turnover so the owners asked me to be a waiter for a little while. The first couple months I found the small talk and the need to always be "on" absolutely exhausting, but overtime I found my routine and it became less exhausting. I definitely loved being a delivery driver far more, but eventually I did hit a point where I didn't feel completely drained after every shift of waiting tables.
I can't help but think coding with LLMs will follow a similar pattern. I don't think I'll ever like it more than writing the code myself, but I have to believe at some point I'll have done it enough that it doesn't feel completely draining.
With the rise of open source, there started to be more black-box compositing, you grabbed some big libraries like Django or NumPy and honestly just hoped there weren't any bugs, but if there were, you could plausibly step through the debugger and figure out what was going wrong and file a bug report.
Now, the LLMs are generating so many orders of magnitude more code than any human could ever have the chance to debug, you're basically just firing this stuff out like a firehose on a house fire, giving it as much control as you can muster but really just trusting the raw power of the thing to get the job done. And, bafflingly, it works pretty well, except in those cases where it doesn't, so you can't stop using the tool but you can't really ever get comfortable with it either.
Not just that, but the fact that with programming languages you can have the utmost precision to describe _how_ the problem needs to be solved _and_ you can have some degree of certainty that your directions (code) will be followed accurately.
It’s maddening to go from that to using natural language which is interpreted by a non-deterministic entity. And then having to endlessly iterate on the results with some variation of “no, do it better” or, even worse, some clever “pattern” of directing multiple agents to check each other’s work, which you’ll have to check as well eventually.
so as a human, you would make the judgement that the cases where it works well enough is more than make up for the mistakes. Comfort is a mental state, and can be easily defeated by separating your own identity and ego with the output you create.
The code part is trivial and a waste of time in some ways compared to time spent making decisions about what to build. And sometimes even a procrastination to avoid thinking about what to build, like how people who polish their game engine (easy) to avoid putting in the work to plan a fun game (hard).
The more clarity you have about what you’re building, then the larger blocks of work you can delegate / outsource.
So I think one overwhelming part of LLMs is that you don’t get the downtime of working on implementation since that’s now trivial; you are stuck doing the hard part of steering and planning. But that’s also a good thing.
The whole time I'm doing it, I'm trying to think of better ways. I'm thinking of libraries, utilities or even frameworks I could create to reduce the tedium.
This is actually one of the things I dislike the most about LLM coding: they have no problem with tedium and will happily generate tens of thousands of lines where a much better approach could exist.
I think it's an innovation killer. Would any of the ORMs or frameworks we have today exist if we'd had LLMs this whole time?
I doubt it.
I've written it up here, including the transcript of an actual real session:
https://www.stavros.io/posts/how-i-write-software-with-llms/
I just woke up recently myself and found out these tools were actually becoming really, really good. I use a similar prompt system, but not as much focus on review - I've found the review bots to be really good already but it is more efficient to work locally.
One question I have since you mention using lots of different models - is do you ever have to tweak prompts for a specific model, or are these things pretty universal?
And when you make the decisions it is you who is responsible for them. Whereas if you just do the coding the decisions about the code are left largely to you nobody much sees them, only how they affect the outcome. Whereas now the LLM is in that role, responsible only for what the code does not how it does it.
LLMs will do pretty much exactly what you tell them, and if you don't tell them something they'll make up something based on what they've been trained to do. If you have rules for what good code looks like, and those are a higher bar than 'just what's in the training data' then you need to build a clear context and write an unambiguous prompt that gets you what you want. That's a lot of work once to build a good agent or skill, but then the output will be much better.
The result is that I could say that it was code that I myself approved of. I can't imagine a time when I wouldn't read all of it, when you just let them go the results are so awful. If you're letting them go and reviewing at the end, like a post-programming review phase, I don't even know if that's a skill that can be mastered while the LLMs are still this bad. Can you really master Where's Waldo? Everything's a mess, but you're just looking for the part of the mess that has the bug?
I'm not reviewing after I ask it to write some entire thing. I'm getting it to accomplish a minimal function, then layering features on top. If I don't understand where something is happening, or I see it's happening in too many places, I have to read the code in order to tell it how to refactor the code. I might have to write stubs in order to show it what I want to happen. The reading happens as the programming is happening.