“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

According to research on LessWrong, evaluations of four methods for detecting deception in large language models revealed a critical limitation. In tests where models were prompted to lie, detectors succeeded 80-85% of the time. But when models were fine-tuned to embed false beliefs through training, all detectors' performance dropped sharply to 50-65% accuracy. The researchers created thirteen "model organisms"—language models trained to believe one thing while claiming another—to properly evaluate these detection methods. The key finding: current approaches can't reliably distinguish whether a model genuinely holds a false belief or merely reproduces false claims. While these detection tools may still help as part of broader AI auditing, the research shows we lack reliable methods to confidently assess model alignment and internal beliefs.

Source: https://www.lesswrong.com/posts/xFbMTnmDAucwE8Cdm/did-you...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton