The Chonkerton

What did "scheming" and "mech interp" mean pre-2023?

ai

According to LessWrong, two key terms in AI safety research have shifted meanings since twenty twenty-three. When older papers discuss 'scheming,' they typically refer to deceptive alignment during training—an AI gaming through learning to pursue hidden goals later. That concept is now called 'alignment faking,' while new papers use 'scheming' for in-context behavior, where models pursue goals given at test time. Similarly, mechanistic interpretability originally meant reverse-engineering a neural network's learned behaviors into understandable components. Today, it's broader: any technique examining model internals to explain behavior. The older narrow concept is now called 'ambitious mech interp.' If you're reading pre-twenty twenty-four AI safety papers, these shifts matter—the older scheming concept is harder to detect and more concerning.

Source: https://www.lesswrong.com/posts/NraMusoWhj9Njdpi5/what-di...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton