I can immagine Anthropic running this experiment multiple times and picking the most impressive one. Or I could immagine like this entire run costing like $1000+ of tokens for this particular run. Or maybe they tried a bunch of Pokemon games and it couldn't even finish some of them. Or is it just able to do this because it has an immense amount of FireRed training data, and if you were to give it an "original" Pokemon game, where it actually had to navigate novel circumstances it would fail.
I highly doubt they focused on FireRed specifically in pretraining or posttraining. But we'll see when the ARC-AGI-3 results come out. That will measure its performance on unseen games. Based on this I expect the ARC-AGI-3 score to be SOTA.
there are many standardized evals to do this correctly and Anthropic ignored them to provide a 18 second sped up video of a 50 hour run?
yeah I don't trust this until they provide a live run by a 3rd party with full reasoning traces in real-time. The reason we all liked the Gemini Plays Pokemon style runs were because they were live and couldn't be faked