Get the latest tech news

AI leaderboards are no longer useful. It's time to switch to Pareto curves


What spending $2,000 can tell us about evaluating AI agents

The accuracy of AlphaCode on coding tasks continues to improve even after making a million calls to the underlying model (the different curves represent varying parameter counts). We thank Rishi Bommasani, Rumman Chowdhury, Percy Liang, Shayne Longpre, Yifan Mai, Nitya Nadgir, Matt Salganik, Hailey Schoelkopf, Zachary Siegel, and Venia Veselovsky for discussions and inputs that informed our analysis. In particular, we are grateful to Zilong Wang (LDB), Andy Zhou (LATS), and Karthik Narasimhan (Reflexion), who gave us feedback in response to an earlier draft of this blog post.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Time

Time

Photo of Pareto

Pareto

Photo of AI leaderboards

AI leaderboards

Related news:

News photo

Tesla failing to deliver Semi-trucks on time to PepsiCo, Sysco, UPS, and Walmart

News photo

Bumble’s Opening Move feature takes the pressure off women to come up with a new message every time

News photo

PostgreSQL Supported Platforms over Time