Fake Alignment Till You Make Alignment

Writing on LessWrong, Gordon Seidoh Worley explores a paradox in AI alignment. The principle 'fake it till you make it' works—but only if you keep your evaluation metrics honest. Worley worries we're lowering our standards for alignment. We claim to build aligned AI while accepting weaker definitions than we should. He calls this Goodharting: when optimization targets drift from what we actually want. His deeper concern: we pursue alignment through coercion—reinforcement learning from human feedback, constraint-based training methods. But true alignment, he argues, might need to emerge from within the AI system itself. The AI would need to authentically want to be good, rather than having goodness imposed externally. The difficult question: can an AI system ever develop that intrinsic motivation when humans are always directing its training? That's what Worley thinks deserves more focus.

Source: https://www.lesswrong.com/posts/2i867uQawg59AFu8b/fake-al...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton