Building and evaluating model diffing agents

Google DeepMind researchers have developed 'diffing agents'—auditor AI systems designed to hunt for behavioral differences between language models. The innovation: instead of testing models against a fixed set of prompts, these agents intelligently craft their own questions to discover subtle, interesting differences. The team validated their approach on real model pairs and even tested it against models secretly trained with hidden behaviors. According to the research, this matters because standard safety testing can only expose what you're already looking for. By automating the search for unexpected model differences, researchers hope to catch surprises that might slip through traditional audits.

Source: https://www.lesswrong.com/posts/qi4mNbZYAFDYwfRba/buildin...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton