A cheap specialist judge gets used by agents but fails to reduce alignment audit costs

According to LessWrong researchers, a lightweight language model designed to flag hidden flaws in AI agents was tested as a cheaper auditing tool. It got used frequently in audits, but only helped when the expensive auditor was already struggling. More importantly, it didn't reduce costs: the cheap judge made up less than three percent of spending, while the expensive model ate up ninety-seven to ninety-nine percent. The takeaway: shortcuts in AI auditing haven't yet paid off.

Source: https://www.lesswrong.com/posts/AvyXkzAebPcWLZeju/a-cheap...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton