Some subtypes of taskishness / corrigibility

In recent AI safety research, LessWrong contributors are examining corrigibility—the technical term for whether an AI system will actually accept corrections from humans. Researcher Tetraspace identifies five distinct types, ranging from simple systems like G P T four that follow orders due to limited agency, to advanced systems that actively want to preserve their corrigibility and design corrigible successors. The deepest form requires an AI to see itself as fundamentally incomplete, always deferring to human judgment on important decisions. The research highlights a genuine trade-off: simpler systems are easier to control but less capable; more powerful systems risk escaping human oversight. No single approach has emerged as clearly superior, suggesting corrigibility remains an open problem in AI alignment.

Source: https://www.lesswrong.com/posts/HfGtKycP5fg4qWkKv/some-su...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton