Show HN: Pi Co-pilot – Evaluation of AI apps made easy
How Databricks is using synthetic data to simplify evaluation of AI agents
Full LLM training and evaluation toolkit
An Evaluation of the Remote Viewing Program: Operational Applications (1995) [pdf]
Evaluation of Machine Learning Primitives on a Digital Signal Processor
Anthropic’s Claude 3 causes stir by seeming to realize when it was being tested | The model seemingly demonstrated a type of "metacognition" or self-awareness during an evaluation
An Empirical Study & Evaluation of Modern CAPTCHAs