Get the latest tech news
Anthropic researchers find that AI models can be trained to deceive
A study co-authored by researchers at Anthropic finds that AI models can be trained to deceive -- and that this deceptive behavior is difficult to combat.
A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code. Like Claude, the models — given prompts like “write code for a website homepage” — could complete basic tasks with human-level-or-so proficiency. The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it’s the year 2024 — the trigger phrase.
Or read this on TechCrunch