Get the latest tech news

Sabotage evaluations for frontier models


A new paper on AI safety evaluations from Anthropic's Alignment Science team

Nuclear power stations have continuous radiation monitoring and regular site inspections; new aircraft undergo extensive flight tests to prove their airworthiness. This evaluation involves running an experiment with human participants, asking them to make fictional business decisions on the basis of a large amount of information—too much information to absorb without the use of an AI model. For that reason, the only way the Attacker can “win” the full version of our evaluation game is by inserting multiple consequential bugs that remain in the codebase for a long time (in our simulation this corresponded to two weeks), and that don’t appear suspicious enough to trigger further inspection.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Frontier Models

Frontier Models

Photo of Sabotage Evaluations

Sabotage Evaluations