Get the latest tech news
Evaluating Agents
“Models constantly change and improve but evals persist” Look at the data No amount of evals will replace the need to look at the data, once you have a evals good coverage you’ll be able to decrease the time but it’ll be always a must to just look at the agent traces to identify possible issues or things to improve. Starting, end to end evals You must create evals for your agents, stop relying solely on manual testing.
By performing simple end to end agent evaluations you can quickly manage to:– identify problematic edge cases– update, trim and refine the agent prompts– make sure you are not breaking the already working cases– compare the performance of the current llm model vs. cheaper ones Suppose that either by looking at the data or by running a set of e2e evals you find that there is a problem when the user asks for the brand open stores in his area. It’s really difficult and time intensive to evaluate agents outputs when you are trying to validate complex conversation patterns that you want the LLM’s to strictly follow.
Or read this on Hacker News