Get the latest tech news

Anthropic wants to stop AI models from turning evil - here's how


Can a new approach to AI model training prevent systems from absorbing harmful data?

That said, President Donald Trump's recent AI Action Plan did mention the importance of interpretability -- or the ability to understand how models make decisions -- which persona vectors add to. Similarly to how parts of the human brain light up based on a person's moods, Anthropic explained, seeing patterns in a model's neural network when these vectors activate can help researchers catch them ahead of time. "Persona vectors are a promising tool for understanding why AI systems develop and express different behavioral characteristics, and for ensuring they remain aligned with human values," Anthropic noted.

Get the Android app

Or read this on ZDNet

Read more on:

Photo of AI models

AI models

Photo of evil

evil

Photo of Anthropic

Anthropic

Related news:

News photo

Anthropic cuts off OpenAI’s access to its Claude models

News photo

Anthropic says OpenAI engineers using Claude Code ahead of GPT-5 launch

News photo

Anthropic Revokes OpenAI's Access To Claude Over Terms of Service Violation