LLM-Driven Feature Discovery

Researchers at the AI Alignment Forum have developed a new technique called LLM-Driven Feature Discovery—a way to understand what's happening inside AI models without needing access to their internal mechanisms. Here's how it works: they take transcripts of model conversations, feed them to another LLM, and ask it to identify interesting features—patterns, behaviors, and quirks. Those features get embedded, clustered, and labeled to reveal what the model is actually doing. The team tested the method on one hundred thousand chat transcripts from Google's Gemini model. Unlike traditional methods that dig into a model's mathematical internals, this approach uses the power of language itself. It's similar to a technique called Sparse Autoencoders, but simpler—it doesn't require repeated iterations or a specific target to optimize toward. The method isn't perfect; the researchers couldn't reliably predict when a model would respond based on user input alone. But it offers an intriguing alternative for understanding AI behavior when you don't have access to the black box.

Source: https://www.alignmentforum.org/posts/WAZWA6FPQvH8okouJ/ll...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton