Toy transformers may represent belief-state geometry optimally but not minimally

A new study in neural network interpretability has found that transformers hoard information. Researchers trained toy transformers on tasks where certain data (tracking a random coin) becomes predictively useless but remains present in the network's residual stream. The team proved this coin information was inert: directly removing it didn't degrade prediction accuracy. Yet the network kept it anyway. When researchers imposed capacity constraints, forcing the network to be selective, it did prune data—but not how optimal prediction theory would predict. Instead of discarding information based on its predictive irrelevance, the transformer preferentially forgot the oldest defunct data first. This suggests it relied on temporal ordering, not information value, to decide what to keep. The finding, published on LessWrong, suggests transformers converge on predictions optimally but don't require true minimal representation. They trade perfect efficiency for simplicity, much like a filing cabinet that holds old papers long past their usefulness.

Source: https://www.lesswrong.com/posts/vzav5kfbRCDQyEB8v/toy-tra...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton