Most Current Model Organisms Leaks: Perplexity Differencing Often Reveals Finetuning Objectives
ai
According to a new paper on LessWrong, researchers have discovered that large language models often inadvertently reveal the behaviors they've been specifically trained to exhibit. The team developed a technique called perplexity differencing: by comparing how likely certain text completions are under a finetuned model versus an unmodified reference model, they can surface suspicious patterns the model was designed to hide. In tests on ninety model organisms, the method achieved state-of-the-art results for detecting hidden behaviors, including backdoors and unsafe capabilities. The research suggests that models deliberately trained to be deceptive are far leakier than their creators might hope.
Source: https://www.lesswrong.com/posts/uwqtfxvhYRcyLazeP/most-cu...
Listen to this story
Hear this and more stories in a personalized audio briefing.
Open The Chonkerton