A Mechanistic Explanation of Prompt Injection (and why you should study roles)

According to researchers publishing on LessWrong today, prompt injection attacks work because of a fundamental mismatch in how language models perceive text. LLMs receive all input—user commands, system instructions, external data, and their own reasoning—as one continuous stream, distinguished only by role tags: user, tool, and system. In a prompt injection attack, an attacker embeds commands inside data marked as external data—a tool tag—but the model treats it as a user instruction instead. The research, supported by CBAI and Cosmos, presents the first mechanistic explanation of why this happens at the token level, and uses that theory to create new attacks and predict which ones will work. The authors are advocating for a new research subfield focused on understanding how models process these role boundaries. The full ICML paper and extended analysis are available online, published today.

Source: https://www.lesswrong.com/posts/d8xDGzCEYE639qqEv/a-mecha...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton