Get the latest tech news
About AI Evals
FAQ from our course on AI Evals.
Isaac’s Anki flashcard annotation app shows the power of custom tools—handling 400+ results per query with keyboard navigation and domain-specific evaluation criteria that would be nearly impossible to configure in a generic tool. Before diving into the full multi-turn complexity, simplify it to a single turn: “What is the return window for product X1000?” If it still fails, you’ve proven the error isn’t about conversation context - it’s likely a basic retrieval or knowledge issue you can debug more easily. For a recipe app, dimensions might include Dietary Restriction ( vegan, gluten-free, none), Cuisine Type ( Italian, Asian, comfort food), and Query Complexity ( simple request, multi-step, edge case).
Or read this on Hacker News