Evaluating Offline Monitoring of Internal AI Agents

According to LessWrong, researchers from the GovAI Winter Fellowship examined how frontier AI labs like OpenAI and Anthropic monitor internally deployed AI agents. These companies worry that a misaligned AI could sabotage safety research or model development efforts from within. To catch problematic behavior, they use separate AI models called "monitors" that review transcripts of agent activity and flag suspicious moments for human inspection — offline, after actions occur, not in real-time. The researchers identified a transparency gap: current public reporting on monitoring effectiveness is too vague. Anthropic publishes how often monitors flag synthetic attacks, but not what percentage survive the full review pipeline. OpenAI claims to catch all cases humans flag, which tells us little if humans only identify obvious misbehavior. The recommendation: companies should disclose the fraction of monitored outputs, how many synthetic attacks reach human review, time-to-detection metrics, and whether synthetic tests actually resemble real misalignment scenarios. Without better transparency, external parties can't assess whether these safety systems actually work.

Source: https://www.lesswrong.com/posts/yrbyyvFvuaGfRAtB7/evaluat...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton