Story Detail of id 41523718 | Liveview Hacker News

evrydayhustling4 months ago | on: Learning to Reason with LLMs

Just did some preliminary testing on decrypting some ROT cyphertext which would have been viable for a human on paper. The output was pretty disappointing: lots of "workish" steps creating letter counts, identifying common words, etc, but many steps were incorrect or not followed up on. In the end, it claimed to check its work and deliver an incorrect solution that did not satisfy the previous steps.

I'm not one to judge AI on pratfalls, and cyphers are a somewhat adversarial task. However, there was no aspect of the reasoning that seemed more advanced or consistent than previous chain-of-thought demos I've seen. So the main proof point we have is the paper, and I'm not sure how I'd go from there to being able to trust this on the kind of task it is intended for. Do others have patterns by which they get utility from chain of thought engines?

Separately, chain of thought outputs really make me long for tool use, because the LLM is often forced to simulate algorithmic outputs. It feels like a commercial chain-of-thought solution like this should have a standard library of functions it can use for 100% reliability on things like letter counts.

changoplatanero4 months ago | parent | next

Hmm, are you sure it was using the o1 model and not gpt4o? I've been using the o1 model and it does consistently well at solving rotation ciphers.

loading story #41523790

loading story #41524570

evrydayhustling4 months ago | root | parent

o1-preview . Were you using common plaintexts by chance (e.g. proverbs) or ROT13 specifically? Mine use all the right steps but just can't string them together.

loading story #41523825

loading story #41523798

charlescurt1234 months ago | parent | next

It's RL so that means it's going to be great on tasks they created for training but not so much on others.

Impressive but the problem with RL is that it requires knowledge of the future.

mewpmewp24 months ago | parent

Out of curiousity can you try the same thing with Claude. Because when I tried Claude with any sort of ROT, it had amazing performance, compared to GPT.

#visit	11477199
#session	45274
#live-session	0