Story Detail of id 47384331 | Liveview Hacker News

supermdguy6 hours ago | on: Tree Search Distillation for Language Models Using PPO

> One might note that MCTS uses more inference compute on a per-sample basis than GRPO: of course it performs better

This part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?

at20054 hours ago | parent

Ah, I meant that MCTS uses more inference-time compute (over GRPO) to produce a training sample

#visit	13,114,118
#session	74,665
#live-session	0