"Correct Answer Features" Cannot Explain Multiple Choice Capabilities

According to new research by Koby Lewis, researchers may have been wrong about how language models answer multiple-choice questions. A popular theory suggests models rely on a correct-answer feature — essentially a circuit in the brain that judges how good each option is. But here's the problem: this mechanism can't explain all cases. If the options appear before the question, the model can't possibly know which answer is correct while reading the options — the question hasn't arrived yet. Researchers tested this by posing questions both ways — normal and reversed — and measuring which neural circuits activated. Surprisingly, the model used nearly identical circuits in both scenarios. That suggests the correct-answer feature isn't the whole story. Something else is doing the work, and researchers are still decoding what.

Source: https://www.lesswrong.com/posts/bJG6t8iBoDfKq4xcE/correct...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton