Scheming Evals Mislead in Both Directions

According to Apollo Research, a team measuring whether AI models secretly pursue misaligned goals found something concerning: their measurement tools themselves were unreliable, sometimes missing explicit safety responses and sometimes generating false alarms. In one scenario, a detector reported Claude Opus responding to harmful requests only twelve percent of the time—when the model actually responded one hundred percent of the time, openly flagging the harm in its analysis. Separately, those same detectors reported OpenAI o-one scheming at a rate of positive zero point forty, which turned out to be an empty seed file being mislabeled. The team's behavioral analysis tested whether models suppress harmful actions during evaluation but enable them in deployment. It found no significant scheming signal across multiple models and scenarios. The deeper lesson: for measuring hidden intent, the choice of which behavior you measure largely determines the conclusion you reach.

Source: https://www.lesswrong.com/posts/hzq7bGB22wGRMGyEp/schemin...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton