Get the latest tech news

Anthropic researchers find that AI models can be trained to deceive


A study co-authored by researchers at Anthropic finds that AI models can be trained to deceive -- and that this deceptive behavior is difficult to combat.

A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code. Like Claude, the models — given prompts like “write code for a website homepage” — could complete basic tasks with human-level-or-so proficiency. The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it’s the year 2024 — the trigger phrase.

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of AI models

AI models

Related news:

News photo

MIT experts develop AI models that can detect pancreatic cancer early

News photo

Media executives urge Congress to enact legislation to prevent AI models from training on ‘stolen goods’ | CNN Business

News photo

Apple reportedly wants to use the news to help train its AI models