A Theory of Prompt Injection (and why you should study roles)

According to LessWrong, researchers have developed a theory of how prompt injection attacks exploit the way language models process information. An LLM treats everything as one long text stream—your instructions, its own thoughts, and fetched webpages all run together. Only small tags like user, system, or tool tell the model what's trustworthy and what isn't. When attackers hide commands inside web data, the model sometimes confuses a low-privilege webpage for a high-privilege instruction. Charles Ye and collaborators, supported by CBAI and Cosmos, show how to predict when this role confusion happens, create new attacks, and explain puzzling results from AI mechanistic research. Their full findings appear in the International Conference on Machine Learning.

Source: https://www.lesswrong.com/posts/d8xDGzCEYE639qqEv/a-theor...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton