Get the latest tech news

Meta’s benchmarks for its new AI models are a bit misleading


Meta appears to have used an unreleased, custom version of one of its new flagship AI models, Maverick, to boost a benchmark score.

As severalAIresearchers pointed out on X, Meta noted in its announcement that the Maverick on LM Arena is an “experimental chat version.” A chart on the official Llama website, meanwhile, discloses that Meta’s LM Arena testing was conducted using “Llama 4 Maverick optimized for conversationality.” Ideally, benchmarks — woefully inadequate as they are — provide a snapshot of a single model’s strengths and weaknesses across a range of tasks. Indeed, researchers on X have observed starkdifferences in the behavior of the publicly downloadable Maverick compared with the model hosted on LM Arena.

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of Meta

Meta

Photo of Bit

Bit

Photo of new ai models

new ai models

Related news:

News photo

In 'Milestone' for Open Source, Meta Releases New Benchmark-Beating Llama 4 Models

News photo

Meta Releases New Llama 4 AI Models With Multimodal Design

News photo

Meta’s answer to DeepSeek is here: Llama 4 launches with long context Scout and Maverick models, and 2T parameter Behemoth on the way!