Most Current Model Organisms Leak: Perplexity Differencing Often Reveals Finetuning Objectives

Researchers at Z3 Research have uncovered a vulnerability in how we fine-tune large language models. When models are deliberately modified to exhibit specific behaviors—whether that's responding in a certain way, harboring false facts, or even hidden misaligned objectives—they tend to leak. The team developed a technique called perplexity differencing to detect these leaks: by comparing how a modified model completes random prompts against an unmodified version, the differences in how likely each model finds a particular response expose what's been installed. Testing on seventy-six models, they achieved state-of-the-art results in detecting hidden behaviors. This matters enormously as AI systems become more integrated into society—auditing and verifying model behavior is now becoming possible in ways it wasn't before.

Source: https://www.lesswrong.com/posts/uwqtfxvhYRcyLazeP/most-cu...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton