The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn't

According to LessWrong researcher Alek Westover, there's a logical puzzle in AI safety called the distillation double bind. When you distill knowledge from a potentially misaligned AI into a smaller model, one of two things happens: the misalignment transfers to the student, potentially revealing itself during audits, or it doesn't transfer, leaving you with a capable but benign model. Westover proposes techniques to exploit both outcomes—either gathering evidence of the original model's misalignment, or safely extracting its capabilities. If one approach fails, it signals the other is more likely to succeed.

Source: https://www.lesswrong.com/posts/gsoLoKY4Spzc3PEb6/the-dis...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton