Same here. My take is that the codebase is too large and complex for it to find the right patterns.
It does work sometimes. The smaller the task, the better.
Isn’t that fixed by having it create a plan, then you review it and say “x should do y instead”, it updates the plan, iterate then “build the plan”?