Prompt Injection as Role Confusion

According to Simon Willison, researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell have identified a critical vulnerability in how language models distinguish between system instructions and user input. They call it role confusion. The vulnerability works like this: while models use role tags like 'system' and 'user' to separate trusted instructions from untrusted input, they actually rely more on the *style* of text than its actual label. So an attacker can mimic that style to confuse the model about what's policy and what's a user request. In tests, rewriting attack prompts to remove the system-prompt style — what they call 'destyling' — reduced the attack success rate from sixty-one percent down to ten percent. That's just a formatting change invisible to humans, yet it completely shifts how the model interprets the text. The researchers warn that without genuine role perception, defending against prompt injection will remain a perpetual arms race, with attackers continually finding new ways to subtly shift a model's behavior through seemingly innocent text.

Source: https://simonwillison.net/2026/Jun/22/prompt-injection-as...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton