Get the latest tech news

Crowdsourced AI benchmarks have serious flaws, some experts say

Crowdsourced AI benchmarks like Chatbot Arena, which have become popular among AI labs, have serious flaws, some experts say.

Over the past few years, labs including OpenAI, Google, and Meta have turned to platforms that recruit users to help evaluate upcoming models’ capabilities. “Benchmarks should be dynamic rather than static data sets,” Hadgu said, “distributed across multiple independent entities, such as organizations or universities, and tailored specifically to distinct use cases, like education, healthcare, and other fields done by practicing professionals who use these [models] for work.” Hadgu and Kristine Gloria, who formerly led the Aspen Institute’s Emergent and Intelligent Technologies Initiative, also made the case that model evaluators should be compensated for their work.

Get the Android app

Or read this on TechCrunch