Get the latest tech news

Evals are not all you need

Andrew Marble marble.onl andrew@kereva.io March 3, 2025 TLDR: Evals make sense for unitless comparison between different base language models (LLMs), and have their place in testing, but the premise of using them to guarantee software performance is flawed. What are evals? Evals (evaluations) refers to test-based performance measurement of AI systems.

TLDR: Evals make sense for unitless comparison between different base language models (LLMs), and have their place in testing, but the premise of using them to guarantee software performance is flawed. But we’ve taken this idea an applied it to validating the performance of all-powerful LLMs that are capable of basically any language construction and hoping that adding a prompt that says “You’re a helpful assistant, only talk about the FAQ on our dog sitting service website” is enough to keep it on task. Even a good implementation of evals suffers from the long tail problem which is itself one symptom of the root issue of the kind of test-and-patch software development that is being used with LLMs.

Get the Android app

Or read this on Hacker News