With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.
The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.
I wish it were different, and maybe in a year or two it will be.
always has been
claude code has opusplan — uses opus while in plan mode, switches to sonnet for execution.
https://code.claude.com/docs/en/model-config#opusplan-model-...
edit: you can make it work with sonnet for planning, and haiku for execution, or any other combination you fancy to work with.
https://code.claude.com/docs/en/model-config#control-the-mod...
1. Step execution (Sonnet): Work for 30 minutes / 100k tokens at the direction of the Orchestrator
2. Review (Opus): Scrutinize the previous step's work for errors, fidelity to the instructions, fix those and record opportunities to improve the agent configuration + tools to reduce errors and token usage (record those to a file).
3. Self-improvement (Opus): Implement the highest impact self-improvement items that don't require user intervention.
Repeat: Until orchestrator session token budget exhausted (set it to 1M or whatever).
The underlying rationale is to keep each step manageable to maximize adherence to instructions and minimize cost (even cached tokens cost something). Prompt tokens are much cheaper than generated, so to the extent Opus mostly reviews rather than drives that saves a lot too. Self-improvement steps are very expensive but the improvements compound, if you're going to run a job for days or weeks it's way more expensive not to do them.
Edit: I do this in Claude Code with the Anthropic models as well as Qwen family models for offline use.
I also work on a consumer AI application https://apps.apple.com/us/app/slidebits-studio/id1138731130
For comparison someone showed me an internal company tool he was working on. He had Claude agents dangerously skipping permissions and firing up github branches through a vm sandbox just to make a single feature change. One agent to code and the other to review.
For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.
Perform a thorough analysis of the <project_name> project (the code and the documentation).
- Explore the project, go over all important files one by one and look for any mistakes or possible bugs.
- Look for refactoring opportunities and ways to improve code quality and organization.
- Identify any potential cruft/bloat, to ensure our code is clean and logically laid out. Keep in mind that efficient and good quality code needs to avoid over-engineered constructs and needless complexity. Avoid complicated logic where simple solutions would be more elegant.
- Pay attention to comments: There should be enough of them to document the intent and provide high-level overview of the code logic, but not too much; avoid/remove excessive comments that simply restate the code logic or do not provide any useful information.
- Every important function should have a top-level docstring comment that clearly explains its purpose, high-level logic overview, arguments, and return values.
- Analyze the names of constants/variables/functions/classes and other code elements: could some of them be renamed to make their purpose more clear?
- Analyze the documentation, uncover any potential inaccuracies/omissions and ensure the docs reflect the code.
- Brainstorm ideas for improvements of the code and docs.
After you finish the analysis, save an analysis report into "<project_name>_analysis_report.md" in the project root folder.I use quite plain prompts, nothing fancy:
> go over the tests and do a code review, focusing on how well they test inventory management, planner and controller. maybe some tests need to be deleted, maybe other tests need to be added. the end goal should be good coverage of the core features.
> do a code review, focusing on robustness/correctness issues. validate that the code correctly implements specification.md. focus on the async client.
> there was a big refactor. please do a code review, focusing on eliminating tech debt. look for unused, obsolete or duplicate code that can be removed, look for mismatched interfaces, inconsistent function/argument/variable names. do not output what is correct, just the issues you found. for each issue output instructions for a coding agent on how to fix it. do not nitpick.
As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.
Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.
I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.
...but I spend so much more time correcting it, or building pipelines to try, retry, and converge, that it's rarely worthwhile for me in either time or $ spent vs Opus.
So by using Opus you are using "smaller" model. Well, not really smaller, just worse. The actual smaller models can at least be faster.