How Databricks is using synthetic data to simplify evaluation of AI agents
Full LLM training and evaluation toolkit
An Evaluation of the Remote Viewing Program: Operational Applications (1995) [pdf]
Evaluation of Machine Learning Primitives on a Digital Signal Processor
Anthropic’s Claude 3 causes stir by seeming to realize when it was being tested | The model seemingly demonstrated a type of "metacognition" or self-awareness during an evaluation
An Empirical Study & Evaluation of Modern CAPTCHAs