Story Detail of id 48395879 | Liveview Hacker News

ikurei13 hours ago | on: I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.

Doesn't that sound like may be the harness was the problem?

jc4p6 hours ago | parent

I was using the same harness for each run, the difference is from when I was running the harness locally on my machine before I pushed up the full runs.

#visit	13,566,889
#session	74,665
#live-session	0