Reasoning and learning about injected concepts in language models

According to research published on LessWrong, scientists have discovered that language models can identify their own internal states with surprising accuracy. The team injected artificial steering signals into the computations of five different models and asked each to identify where those signals appeared, how strong they were, and what they meant. Some of the larger models showed strong performance at this task and could even adjust their own behavior based on what they detected—without using step-by-step reasoning. The finding suggests that language models have direct access to their own inner computations, which could significantly advance AI interpretability research.

Source: https://www.lesswrong.com/posts/de2qaz6G3qrFZvQqK/reasoni...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton