Get the latest tech news

Meta’s benchmarks for its new AI models are a bit misleading

Meta appears to have used an unreleased, custom version of one of its new flagship AI models, Maverick, to boost a benchmark score.

As severalAIresearchers pointed out on X, Meta noted in its announcement that the Maverick on LM Arena is an “experimental chat version.” A chart on the official Llama website, meanwhile, discloses that Meta’s LM Arena testing was conducted using “Llama 4 Maverick optimized for conversationality.” Ideally, benchmarks — woefully inadequate as they are — provide a snapshot of a single model’s strengths and weaknesses across a range of tasks. Indeed, researchers on X have observed starkdifferences in the behavior of the publicly downloadable Maverick compared with the model hosted on LM Arena.

Get the Android app

Or read this on TechCrunch