Introspection or entropy? Re-examining concept-injection “introspection” in open models

According to research published on LessWrong, recent claims that large language models can introspect—detect their own internal thoughts—may be severely overstated. Anthropic published findings showing models like Claude Opus could identify injected concepts in their activations roughly twenty percent of the time. But researcher Agastya Sridharan replicated those experiments and found a different picture: injecting a concept like 'oceans' doesn't make the model detect it specifically. Instead, it scrambles the entire output distribution, making yes answers more likely across completely unrelated questions too. When the model later names the injected concept, that's simply concept leakage—the steering vector still active—not evidence the model actually read its own mind. In tests where the injected concept and asked-about concept don't match, the model answers yes at statistically identical rates. The takeaway: what looks like introspection appears to be entropy—output noise, not self-awareness.

Source: https://www.lesswrong.com/posts/zfgQCdnMBa3hBdzpL/introsp...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton