I love that they included some unsuccessful attempts.
MCTS doesn't seem to have worked for them.
Also wild that few shot prompting leads to worse results in reasoning models. OpenAI hinted at that as well, but it's always just a sentence or two, no benchmarks or specific examples.