One axis and two features, how I solved the first puzzle from BlueDot and how a classifier hid country on the food direction

According to Igor Peverezev on LessWrong, a small text classifier hides a surprising secret: standard interpretability methods miss how one of its features is actually encoded. The puzzle involves a classifier trained to detect eight simple properties in text—things like whether it mentions food or contains a number. Seven features show up linearly in the model's hidden layers; linear probes spot them easily. But the eighth—country—breaks that pattern. A linear probe sees only noise; it performs barely better than random guessing. Yet a nonlinear probe recovers the feature almost perfectly. The reason? The country feature isn't stored by direction; it's stored by distance. The sign of the projection doesn't matter—only how far the activation lies along that axis. Peverezev built a methodical case for this hypothesis by testing simpler explanations first: boolean combinations of other features, then progressively more complex geometries. This matters because it shows a real gap in how we interpret neural networks. Standard linear probes assume features are written linearly—a common assumption that can miss stranger, more compact encodings hiding in plain sight.

Source: https://www.lesswrong.com/posts/ZPNf3QbQ5rLgkKtY7/one-axi...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton