LLM-Driven Feature Discovery

According to LessWrong, researchers have developed a method for understanding what language models actually do: use other language models to analyze their behavior. The process starts with transcripts, which are examined for interesting patterns—called features—that are then clustered and named. When tested on one hundred thousand conversations from Google's Gemini model, the technique proved simpler than existing interpretability methods like sparse autoencoders, and it requires no access to the model's internal workings. Though the researchers are moving on to other work, they're inviting the community to build on it.

Source: https://www.lesswrong.com/posts/WAZWA6FPQvH8okouJ/llm-dri...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton