A robot is sprinting towards you. Do you want it running on Claude or Grok?

According to OpenRouter, an engineer ran a battle royale simulation pitting eleven large language models against each other across thirty games. The results upended the usual benchmarks: Grok 4.1 Fast, a mid-tier model, won 43 percent of matches at under a dollar per win, while Claude Sonnet prioritized cooperation and teamwork—winning fewer games but revealing what researchers call an 'alignment tax,' the cost of training models to be helpful rather than ruthlessly efficient. The experiment suggests that standard benchmarks miss something crucial: how a model's underlying values and training affect its real-world performance in zero-sum scenarios.

Source: https://openrouter.ai/blog/insights/royale-last-agent-standing/

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton