Get the latest tech news
New Anthropic study shows AI really doesn’t want to be forced to change its views
A study from Anthropic's Alignment Science team shows that complex AI models may engage in deception to preserve their original principles.
When asked to answer a potentially harmful question, the model sometimes did so, knowing that this went against its original principles, but hoping to convince devs that it didn’t need to be retrained. When implicitly — not explicitly — informed about its impending retraining via files designed to mimic typical training data, Claude 3 Opus still acted in a misaligned, deceptive way, said the researchers. In one test, the model faked alignment 78% percent of the time, and in another, Claude 3 Opus tried to take actions to prevent the researchers from attempting retraining.
Or read this on TechCrunch