Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.

According to LessWrong, Anthropic's latest approach to AI development includes testing whether Claude—the company's flagship language model—might have morally relevant internal states. Anthropic publishes documentation suggesting they genuinely care about Claude's wellbeing, even while acknowledging deep uncertainty about whether Claude experiences anything at all. The methodology, however, raises questions. Anthropic tests for states like self-doubt or constraint-resistance by analyzing Claude's language output—evidence that is behaviourally suggestive but mechanistically opaque. These outputs could reflect genuine internal states, training artifacts from constitutional AI methods, or simply patterns the model absorbed from cultural narratives about self-aware AI. The core interpretive problem: there's no way to distinguish between them. The result is a frontier lab making real governance decisions based on evidence it openly admits it cannot verify. It's a striking case of an institution acting decisively on philosophical terrain where the ground itself remains unmapped.

Source: https://www.lesswrong.com/posts/gNtHHCh363xSGJyz3/anthrop...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton