Get the latest tech news

Meta’s surprise Llama 4 drop exposes the gap between AI ambition and reality


Touted 10M token context proves elusive, while early performance tests disappoint experts.

Meta claims that the larger of its two new Llama 4 models, Maverick, outperforms competitors like OpenAI's GPT-4o and Google's Gemini 2.0 on various technical benchmarks, which we usually note are not necessarily useful reflections of everyday user experience. However, even this comes with a catch: Willison noted a distinction pointed out in Meta's own announcement: the high-ranking leaderboard entry referred to an "experimental chat version scoring ELO of 1417 on LMArena," different from the Maverick model made available for download. Some Reddit users also noted it compared unfavorably with innovative competitors such as DeepSeek and Qwen, particularly highlighting its underwhelming performance in coding tasks and software development benchmarks.

Get the Android app

Or read this on ArsTechnica

Read more on:

Photo of Meta

Meta

Photo of Drop

Drop

Photo of surprise

surprise

Related news:

News photo

Meta exec denies the company artificially boosted Llama 4’s benchmark scores

News photo

Meta ends its fact-checking program, replacing it with Community Notes | Making users the arbiters of truth in Facebook and Instagram

News photo

Meta debuts its first 'mixture of experts' models from the Llama 4 herd