Get the latest tech news

New secret math benchmark stumps AI models and PhDs alike


FrontierMath’s difficult questions remain unpublished so that AI companies can’t train against it.

The benchmark tests AI language models (such as GPT-4o, which powers ChatGPT) against original mathematics problems that typically require hours or days for specialist mathematicians to complete. The designers made problems "guessproof" by requiring large numerical answers or complex mathematical solutions, with less than a 1 percent chance of correct random guesses. "Because an AI system has vastly greater computational power, it's actually possible to design problems with easily verifiable solutions using the same idea that IOI or Project Euler does—basically, 'write a proof' is replaced by 'implement an algorithm in code,'" Chen explained.

Get the Android app

Or read this on ArsTechnica

Read more on:

Photo of PhDs

PhDs