Hacker News new | past | comments | ask | show | jobs | submit
I assume it's a lack of care when RLing them.

RL has a tendency to reinforce cheating when the cheats are easier to find than the final solution.

So when making your RL environment, you need to spend a lot of effort on finding ways the model can cheat and penalizing them.