I'm not one to judge AI on pratfalls, and cyphers are a somewhat adversarial task. However, there was no aspect of the reasoning that seemed more advanced or consistent than previous chain-of-thought demos I've seen. So the main proof point we have is the paper, and I'm not sure how I'd go from there to being able to trust this on the kind of task it is intended for. Do others have patterns by which they get utility from chain of thought engines?
Separately, chain of thought outputs really make me long for tool use, because the LLM is often forced to simulate algorithmic outputs. It feels like a commercial chain-of-thought solution like this should have a standard library of functions it can use for 100% reliability on things like letter counts.
Impressive but the problem with RL is that it requires knowledge of the future.