Probing the loss-band sparsity assumption in Scientist AI

A new post on LessWrong by Alejandro Tlaie explores a recent safety theory proposal: that dangerous AI predictors are rare within what's called a "loss band"—the region of parameter space where models have similar loss values. He tested the hypothesis empirically by perturbing a safety-tuned language model with random, undirected changes that preserved its loss level, then checking whether safety degraded. Across one hundred seventeen accepted samples, all maintained their refusal margins—the minimum was still nearly one unit above the danger threshold. The finding supports the paper's claim: dangerous models, while possibly nearby in parameter space, occupy negligible volume within a loss band when explored at random. However, directed fine-tuning attacks can break safety at similar perturbation scales, suggesting the distinction is not distance but direction. For AI safety, this matters: an adversary who knows where to look can cause harm; a blind walk through parameter space is unlikely to find it.

Source: https://www.lesswrong.com/posts/zJGGZQdtfoNye5ywe/probing...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton