Hacker News new | past | comments | ask | show | jobs | submit
Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.

Doesn't that sound like may be the harness was the problem?

I was using the same harness for each run, the difference is from when I was running the harness locally on my machine before I pushed up the full runs.