Preventative Steering has advantages over Inoculation Prompting

Researchers at the SPAR Fellowship compared two training methods for making AI models behave better. One approach, called Preventative Steering, adjusts models' internal representations—essentially their decision-making math—to favor desired behaviors. The other, Inoculation Prompting, explicitly instructs models to exhibit bad behaviors during training, then evaluates them without those instructions. Steering proved more effective at suppressing unwanted traits and prevented behaviors from re-emerging under different contexts. However, prompting remains simpler and more practical to implement at scale.

Source: https://www.lesswrong.com/posts/R43jFiHaaaK3eMQJ6/prevent...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton