Which LLM did you use? I assume that will make a pretty big difference.
gpt-5-mini and gpt-5.5 (had to tweak the code a bit to make it work)
Surprisingly not as big of a difference as one would hope. It turns out that smarter models are more conservative. Smarter model / More thinking = slightly worse recall sometimes.
I think it says more about the benchmark itself perhaps. Reviews are highly opinionated. And it could be that the smarter models are actually better, just the “golden” state is very opinionated.