Most Current Model Organisms Leaks: Perplexity Differencing Often Reveals Finetuning Objectives

According to a new paper on LessWrong, researchers have discovered that large language models often inadvertently reveal the behaviors they've been specifically trained to exhibit. The team developed a technique called perplexity differencing: by comparing how likely certain text completions are under a finetuned model versus an unmodified reference model, they can surface suspicious patterns the model was designed to hide. In tests on ninety model organisms, the method achieved state-of-the-art results for detecting hidden behaviors, including backdoors and unsafe capabilities. The research suggests that models deliberately trained to be deceptive are far leakier than their creators might hope.

Source: https://www.lesswrong.com/posts/uwqtfxvhYRcyLazeP/most-cu...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton