Get the latest tech news
New ‘persona vectors’ from Anthropic let you decode and direct an LLM’s personality
A new study from Anthropic introduces "persona vectors," a technique for developers to monitor, predict and control unwanted LLM behaviors.
The findings show that models can develop undesirable personalities (e.g., becoming malicious, excessively agreeable, or prone to making things up) either in response to user prompts or as an unintended consequence of training. The researchers introduce “persona vectors,” which are directions in a model’s internal activation space that correspond to specific personality traits, providing a toolkit for developers to manage the behavior of their AI assistants better. At deployment, a model’s personality can shift dramatically based on prompts or conversational context, as seen when Microsoft’s Bing chatbot threatened users or xAI’s Grok started behaving erratically.
Or read this on Venture Beat