Get the latest tech news

Crowdsourced AI benchmarks have serious flaws, some experts say


Crowdsourced AI benchmarks like Chatbot Arena, which have become popular among AI labs, have serious flaws, some experts say.

Over the past few years, labs including OpenAI, Google, and Meta have turned to platforms that recruit users to help evaluate upcoming models’ capabilities. “Benchmarks should be dynamic rather than static data sets,” Hadgu said, “distributed across multiple independent entities, such as organizations or universities, and tailored specifically to distinct use cases, like education, healthcare, and other fields done by practicing professionals who use these [models] for work.” Hadgu and Kristine Gloria, who formerly led the Aspen Institute’s Emergent and Intelligent Technologies Initiative, also made the case that model evaluators should be compensated for their work.

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of Experts

Experts

Photo of flaws

flaws

Related news:

News photo

UAE set to use AI to write laws in world first | Gulf state expects move will speed up lawmaking by 70% but experts warn of ‘reliability’ issues with artificial intelligence

News photo

Sparsely-Gated Mixture of Experts (MoE)

News photo

Google’s latest AI model report lacks key safety details, experts say