Get the latest tech news
Why it’s impossible to review AIs, and why TechCrunch is doing it anyway
Every week seems to bring with it a new AI model, and the technology has unfortunately outpaced our ability to evaluate it comprehensively.
The tl;dr: These systems are too general and are updated too frequently for evaluation frameworks to stay relevant, and synthetic benchmarks provide only an abstract view of certain well-defined capabilities. For instance, when you ask Gemini how to get to a good Thai spot near you, it doesn’t just look inward at its training set and find the answer; after all, the chance that some document it’s ingested explicitly describes those directions is practically nil. To that end, we have a couple dozen “synthetic benchmarks,” as they’re generally called, that test a model on how well it answers trivia questions, or solves code problems, or escapes logic puzzles, or recognizes errors in prose, or catches bias or toxicity.
Or read this on TechCrunch