Why Do Naive SFT Filters For Safety Properties Fail?

Researchers at Google DeepMind have discovered that filtering undesirable behaviors from large language models is harder than expected. In a new study on model interpretability, the team analyzed why supervised fine-tuning filters fail to eliminate safety issues like date confusion, negative emotion, and blackmail propensity. Using a novel method called post-training diffing—which compares different training pipelines—they found that some behaviors transfer in unexpected ways, even from seemingly unrelated training data. Notably, certain traits become locked into models during pretraining, and removing them requires more than just filtering the training set. The findings suggest new challenges for AI labs approaching safety alignment as models become more capable.

Source: https://www.alignmentforum.org/posts/wyZRNgpeiPeRXB6eT/wh...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton