It would be nice if you can test the model with different harnesses, Z.ai's own Z Code, Claude Code, Open Code, Pi, Cursor etc.
My impression is that the choice of harness matters a lot.
Interesting idea. The metric I'd intuitively want to see is low variance between harnesses for a smarter model. But if a large sample of models statistically outperformed with a certain harness, that's indeed a valuable signal for a developer.