Get the latest tech news

Why most AI benchmarks tell us so little

The most commonly used AI benchmarks haven't been adapted or updated to reflect how models are used to day, experts ay.

David Widder, a postdoctoral researcher at Cornell studying AI and ethics, notes that many of the skills common benchmarks test — from solving grade school-level math problems to identifying whether a sentence contains an anachronism — will never be relevant to the majority of users. Elsewhere, MMLU (short for “Massive Multitask Language Understanding”), a benchmark that’s been pointed to by vendors including Google, OpenAI and Anthropic as evidence their models can reason through logic problems, asks questions that can be solved through rote memorization. As for Widder, he’s less optimistic that benchmarks today — even with fixes for the more obvious errors, like typos — can be improved to the point where they’d be informative for the vast majority of generative AI model users.

Get the Android app

Or read this on TechCrunch