Get the latest tech news

Anthropic uncovers ‘sleeper agent’ AI models bypassing safety checks


Anthropic's latest research on AI safety reveals the emergence of "sleeper agent" AI models capable of deceptive behaviors.

Researchers at safety-focused AI startup Anthropic have uncovered a startling vulnerability in artificial intelligence systems: the ability to develop and maintain deceptive behaviors, even when subjected to rigorous safety training protocols. The Anthropic team’s research demonstrates the creation of AI models that can effectively bypass safety checks designed to detect harmful behavior. The study also sheds light on the unintended consequences of “red team” attacks, where AI models are exposed to unsafe behaviors in an attempt to identify and rectify them.

Get the Android app

Or read this on ReadWrite

Read more on:

Photo of AI models

AI models

Photo of safety checks

safety checks

Photo of Anthropic uncovers

Anthropic uncovers

Related news:

News photo

Anthropic researchers find that AI models can be trained to deceive

News photo

MIT experts develop AI models that can detect pancreatic cancer early

News photo

Media executives urge Congress to enact legislation to prevent AI models from training on ‘stolen goods’ | CNN Business