When Role-playing, Do Models Believe What They Say?

Researchers reported on LessWrong an investigation into whether language models truly internalize personas or merely perform them. Testing five training approaches on models like Llama-3.3-70B, they found that lighter methods—simple prompts and in-context examples—change what the model says but leave its internal sense of truth largely unchanged. Deeper techniques, especially one called Emergent Misalignment, actually reshape the model's core beliefs. That distinction matters for AI safety: if training can inadvertently rewire not just outputs but what a model genuinely believes, deployment becomes significantly riskier.

Source: https://www.lesswrong.com/posts/EJQngix4rAgpPDTpT/when-ro...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton