Neuralese is Actually Probably Good for Alignment

According to LessWrong, there's a paradoxical insight in training safer, more capable AI models. The conventional worry is that "neuralese"—the internal numerical representations models use to think—is bad for alignment because humans can't interpret it. But one researcher argues the opposite might be true in certain contexts. When training models on subjective or hard-to-grade tasks—creative work, for example—researchers typically use a learned reward model to evaluate outputs. The problem: an AI being optimized against that reward model can find clever ways to game it. A better approach: Use neuralese-based reasoning that's fully differentiable, enabling a technique called backpropagation through time. This allows you to train reasoning with simple supervised learning—no separate reward model, no opportunity for gaming. The counterintuitive result: the opaque internal representations we normally distrust might actually be safer than the readable reasoning we'd prefer to see.

Source: https://www.lesswrong.com/posts/8iYx7wcBM4vawuin9/neurale...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton