Get the latest tech news

Google Gemini unexpectedly surges to No. 1, over OpenAI, but benchmarks don’t tell the whole story


Google's Gemini-Exp-1114 AI model tops key benchmarks, but experts warn traditional testing methods may no longer accurately measure true AI capabilities or safety, raising concerns about the industry's current evaluation standards.

Testing platform Chatbot Arena reported that the experimental Gemini version demonstrated superior performance across several key categories, including mathematics, creative writing, and visual understanding. When researchers controlled for superficial factors like response formatting and length, Gemini’s performance dropped to fourth place — highlighting how traditional metrics may inflate perceived capabilities. The race between tech giants to achieve ever-higher benchmark scores continues, but the real competition may lie in developing entirely new frameworks for evaluating and ensuring AI system safety and reliability.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of OpenAI

OpenAI

Photo of story

story

Photo of Google Gemini

Google Gemini

Related news:

News photo

OpenAI’s tumultuous early years revealed in emails from Musk, Altman, and others

News photo

One Year After Altman’s Ouster, OpenAI Remains Dominant

News photo

OpenAI at one point considered acquiring AI chip startup Cerebras