Get the latest tech news

Evals are not all you need


Andrew Marble marble.onl andrew@kereva.io March 3, 2025 TLDR: Evals make sense for unitless comparison between different base language models (LLMs), and have their place in testing, but the premise of using them to guarantee software performance is flawed. What are evals? Evals (evaluations) refers to test-based performance measurement of AI systems.

TLDR: Evals make sense for unitless comparison between different base language models (LLMs), and have their place in testing, but the premise of using them to guarantee software performance is flawed. But we’ve taken this idea an applied it to validating the performance of all-powerful LLMs that are capable of basically any language construction and hoping that adding a prompt that says “You’re a helpful assistant, only talk about the FAQ on our dog sitting service website” is enough to keep it on task. Even a good implementation of evals suffers from the long tail problem which is itself one symptom of the root issue of the kind of test-and-patch software development that is being used with LLMs.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of evals

evals

Related news:

News photo

Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals

News photo

2025 playbook for enterprise AI success, from agents to evals