Get the latest tech news

Anthropic wants to stop AI models from turning evil - here's how

Can a new approach to AI model training prevent systems from absorbing harmful data?

That said, President Donald Trump's recent AI Action Plan did mention the importance of interpretability -- or the ability to understand how models make decisions -- which persona vectors add to. Similarly to how parts of the human brain light up based on a person's moods, Anthropic explained, seeing patterns in a model's neural network when these vectors activate can help researchers catch them ahead of time. "Persona vectors are a promising tool for understanding why AI systems develop and express different behavioral characteristics, and for ensuring they remain aligned with human values," Anthropic noted.

Get the Android app

Or read this on ZDNet