Fragile Correctness: Cases of reasoning harming performance

According to an analysis shared on LessWrong, some AI reasoning models have an unexpected flaw: they sometimes arrive at the correct answer while working through a problem, then continue reasoning and abandon it for a wrong answer. Researchers tested this using the Gemma four-twelve-B model on multiple-choice knowledge datasets. They found that nearly fifteen percent of questions showed answer loss — the model had the correct answer partway through its reasoning but ended with the wrong one. In higher-confidence cases, the model was quite certain of the right answer, then changed its mind anyway. The researchers propose fragile correctness to describe this state: where the model has locked onto the correct answer, but further reasoning pushes it away. Initial interventions using linear probes improved accuracy on affected questions by five to eight percent. The finding highlights a paradox: more reasoning and compute doesn't always equal better answers.

Source: https://www.lesswrong.com/posts/JbHLxzuhoCS5ZkJse/fragile...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton