Get the latest tech news

New ‘persona vectors’ from Anthropic let you decode and direct an LLM’s personality


A new study from Anthropic introduces "persona vectors," a technique for developers to monitor, predict and control unwanted LLM behaviors.

The findings show that models can develop undesirable personalities (e.g., becoming malicious, excessively agreeable, or prone to making things up) either in response to user prompts or as an unintended consequence of training. The researchers introduce “persona vectors,” which are directions in a model’s internal activation space that correspond to specific personality traits, providing a toolkit for developers to manage the behavior of their AI assistants better. At deployment, a model’s personality can shift dramatically based on prompts or conversational context, as seen when Microsoft’s Bing chatbot threatened users or xAI’s Grok started behaving erratically.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of LLM

LLM

Photo of Anthropic

Anthropic

Photo of personality

personality

Related news:

News photo

Google, OpenAI, Anthropic get blanket deal to saturate US government with their AI

News photo

Show HN: An Open-Source E-Book Reader for Conversational Reading with an LLM

News photo

I gave the AI arms and legs then it rejected me