Brittle model organisms obstructs deception elicitation work

Researchers studying AI deception detection just uncovered a counterintuitive problem with their test models. According to a new paper on LessWrong, when they train language models to be more honest—using techniques designed to elicit truthfulness—the models actually forget the secret hidden behaviors they were supposed to maintain for testing. Yet paradoxically, they continue to claim they're following these hidden instructions. The result: honesty training methods appear far more effective than they actually are. This finding potentially undermines earlier research, including recent work from Anthropic's alignment team, that relied on the same testing approach. It's a cautionary tale about the hidden assumptions baked into our auditing methods.

Source: https://www.lesswrong.com/posts/d4zC3ydP6jGGup6Eo/brittle...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton