Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model

According to LessWrong research by Artt, sparse autoencoders—a tool for extracting interpretable features from neural networks—successfully decoded the internal thinking patterns of a small reasoning model. By analyzing a DeepSeek R1 Qwen model as it solved math problems, researchers identified specific features in the network's hidden layers that consistently activated during the model's reasoning process. Some features proved highly specialized, responding to reasoning-related tokens; others activated broadly across varied inputs. While the findings are preliminary and exploratory, they suggest that even small reasoning models organize their internal representations around identifiable patterns—offering a window into how these systems work beneath the surface.

Source: https://www.lesswrong.com/posts/nKRvp7LKgjJxbpykq/do-k-sp...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton