Right, and that's why it's only part of the job. The benchmarks they're currently doing compose of the AI being handed a detailed spec + tests to make pass which isn't really what developing a feature looks like.
Going from fuzzy under-defined spec to something well defined isn't solved.
Going from well defined spec to verification criteria also isn't.
Once those are in place though, we get https://vinext.io - which from what I understand they largely vibe-coded by using NextJS's test suite.
> First one that comes to mind is that 100% code coverage in tests means that software is perfect
I agree.. but I'm also not sure if software needs to be perfect