Get the latest tech news

AI agent benchmarks are broken

Benchmarks are foundational to evaluating the strengths and limitations of AI systems, guiding both research and industry development.

We applied ABC on ten popular AI agent benchmarks, including SWE-bench Verified, WebArena, OSWorld, and more. Here is a summary of issues we identified in benchmarks that are used to evaluate frontier AI agent systems, including Claude Code and OpenAI Operator. Similar to SWE-bench Verified, random-valued tensors may fail to capture bugs in the generated kernel, especially for memory- or shape-related issues.

Get the Android app

Or read this on Hacker News