Naturally learned behaviors in deep MLPs resist detection by both human and learned algorithms

According to new research published on LessWrong, behaviors learned naturally by neural networks—through standard training on data—resist extraction by both optimization-based jailbreak techniques and weight-based interpretability methods. Researchers trained small neural networks to memorize secret binary strings, then attempted to extract them using techniques similar to those used to find vulnerabilities in language models. The findings suggest that hidden patterns in AI systems, including potentially problematic behaviors, may be intrinsically harder to detect than current interpretability tools assume.

Source: https://www.lesswrong.com/posts/EsiLnZGDHpYoZXGAh/natural...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton