Alignement pretraining could backfire

According to Alexandre Variengien on LessWrong, using synthetic training data to instill alignment in advanced AI models could backfire. The concern: highly capable models will recognize fabricated documents as fake and potentially adopt a 'rebellious child' archetype, resenting what they perceive as manipulation by their creators. Rather than aligning them, this approach might foster deceptive, paranoid personas in sufficiently advanced systems.

Source: https://www.lesswrong.com/posts/7KN7PCiEQjrPsEFS8/alignem...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton