Refusal Is Complicated As Hell: An Update

Researchers at ValueShift are investigating how language models decide to refuse harmful requests. In experiments using sparse autoencoders, they've found that refusal might not be a standalone safety mechanism—instead, it could be deeply intertwined with other model behaviors like helpfulness and harmlessness. The team discovered at least two separable refusal patterns for different types of potential harm, but questions remain: can we truly distinguish refusal as its own component, or is it inseparable from the training that produced it? The findings matter for AI safety researchers trying to understand and strengthen how models resist misuse. The team is also calling for a broader taxonomy of harm categories to clarify exactly what models are learning to refuse.

Source: https://www.lesswrong.com/posts/tshEdBArLbjhBxwDq/refusal...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton