Not all features are created equal

According to a post on LessWrong, researchers are getting better at understanding how large language models work—and doing it more cheaply. When scientists extract features from AI models using mechanistic interpretability, they've discovered these features fall into distinct categories: input features that handle raw tokens, output features that shape model responses, and abstract features that handle higher-level reasoning. The problem is that current AI-powered tools that explain these features don't account for this variety. They make a single large language model try to figure out both what category a feature belongs to and what it does, all at once. A new protocol called AIR—the Auto-Interpretability Router—takes a smarter approach. It uses a lightweight sentence embedder to first identify a feature's category, then routes specific examples to the most appropriate explanation tool. The result: explanations that are more accurate than existing state-of-the-art methods, while using nine times fewer tokens than Neuronpedia's approach and sixty times fewer than OpenAI's. That's mechanistic interpretability getting faster and sharper at once.

Source: https://www.lesswrong.com/posts/Y5Lv8vafibpo7iYhi/not-all...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton