Synthetic document finetuning for instilling positive traits

According to DeepMind's Language Model Interpretability team, researchers have developed a technique for teaching AI models like Gemini to behave more reliably according to chosen principles. The method involves two phases: first, they create synthetic documents describing scenarios where the model exhibits the desired traits, then they fine-tune on chat data where the model naturally demonstrates those behaviors. Testing showed the approach works even in novel situations the model hasn't encountered before—a key indicator that the model genuinely learned the principles, rather than just mimicking the training data. The finding is part of ongoing research into deep alignment: building AI systems whose values don't shift when they encounter unexpected situations.

Source: https://www.alignmentforum.org/posts/GTYJRLhqztxKF2v5R/sy...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton