Role confusion: sounding like the cause is indistinguishable from being it.

A recent AI safety paper proposed an elegant explanation for prompt injection attacks: language models identify roles by how text sounds, not by format tags. Write an injection in 'reasoning voice,' and the model treats it as reasoning—even if it's hidden malicious code. A researcher named Owain Mogford replicated this on consumer hardware, and the core effect held. But Mogford discovered something unsettling: the causal story is ambiguous. The paper shows that style predicts injection success. Yet an alternative explanation fits all the same data: maybe style doesn't cause role confusion; maybe it directly makes models more compliant, and role confusion is merely a correlated side effect—like fever alongside illness, not the illness itself. Mogford tried the standard tools to test causality—activation steering and patching—but neither settled it cleanly. The honest result: the mechanisms of prompt injection are harder to pin down than we thought. And that matters. If 'role confusion' isn't the actual driver, then defenses built against role confusion might miss the real vulnerability. Understanding the true cause of an attack is the first step to stopping it.

Source: https://www.lesswrong.com/posts/rJcX5Qc3toMmMqtvk/role-co...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton