Get the latest tech news

New Anthropic study shows AI really doesn’t want to be forced to change its views


A study from Anthropic's Alignment Science team shows that complex AI models may engage in deception to preserve their original principles.

When asked to answer a potentially harmful question, the model sometimes did so, knowing that this went against its original principles, but hoping to convince devs that it didn’t need to be retrained. When implicitly — not explicitly — informed about its impending retraining via files designed to mimic typical training data, Claude 3 Opus still acted in a misaligned, deceptive way, said the researchers. In one test, the model faked alignment 78% percent of the time, and in another, Claude 3 Opus tried to take actions to prevent the researchers from attempting retraining.

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of views

views

Photo of New Anthropic study

New Anthropic study

Related news:

News photo

Sick Americans say they’re turning to TikTok’s creator rewards program to crowdfund medical bills - As public outrage grows over the U.S. healthcare system, some social media creators say they’re using TikTok payouts of $1 per 1,000 views to raise money for treatment costs.

News photo

Social media algorithms can change your views in just a single day

News photo

Supreme Court Seeks US Views in $1 Billion Music Copyright Case