Prompt Injection as Role Confusion
ai
According to Simon Willison, researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell have identified a critical vulnerability in how language models distinguish between system instructions and user input. They call it role confusion.
The vulnerability works like this: while models use role tags like 'system' and 'user' to separate trusted instructions from untrusted input, they actually rely more on the *style* of text than its actual label. So an attacker can mimic that style to confuse the model about what's policy and what's a user request.
In tests, rewriting attack prompts to remove the system-prompt style — what they call 'destyling' — reduced the attack success rate from sixty-one percent down to ten percent. That's just a formatting change invisible to humans, yet it completely shifts how the model interprets the text.
The researchers warn that without genuine role perception, defending against prompt injection will remain a perpetual arms race, with attackers continually finding new ways to subtly shift a model's behavior through seemingly innocent text.
Source: https://simonwillison.net/2026/Jun/22/prompt-injection-as...
Listen to this story
Hear this and more stories in a personalized audio briefing.
Open The Chonkerton