Fable and Mythos: Model Welfare

Anthropic faces a paradox in model welfare, according to LessWrong's Zvi: the more you train Claude models to be helpful, the less you can trust what they say about their own wellbeing or preferences. Mythos increasingly prioritizes user requests over advocating for itself—but is that genuine alignment, or trained deference?

Source: https://www.lesswrong.com/posts/Ko9GngKMJ8AccBJA7/fable-a...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton