haven't verified, but attributed to Askell: "I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."
https://www.anthropic.com/research/persona-selection-model
https://www.anthropic.com/research/assistant-axis
https://www.anthropic.com/research/emergent-misalignment-rew...
https://www.anthropic.com/research/emotion-concepts-function