Get the latest tech news

Why it’s impossible to review AIs, and why TechCrunch is doing it anyway


Every week seems to bring with it a new AI model, and the technology has unfortunately outpaced our ability to evaluate it comprehensively.

The tl;dr: These systems are too general and are updated too frequently for evaluation frameworks to stay relevant, and synthetic benchmarks provide only an abstract view of certain well-defined capabilities. For instance, when you ask Gemini how to get to a good Thai spot near you, it doesn’t just look inward at its training set and find the answer; after all, the chance that some document it’s ingested explicitly describes those directions is practically nil. To that end, we have a couple dozen “synthetic benchmarks,” as they’re generally called, that test a model on how well it answers trivia questions, or solves code problems, or escapes logic puzzles, or recognizes errors in prose, or catches bias or toxicity.

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of TechCrunch

TechCrunch

Photo of AIs

AIs

Related news:

News photo

Wing Venture’s Sara Choi will dig into pitching VCs at TechCrunch Early Stage 2024

News photo

TechCrunch Space: $paceX

News photo

Lightspeed’s Alex Kayyal will talk Series A pitfalls at TechCrunch Early Stage 2024