Get the latest tech news

Persona vectors: Monitoring and controlling character traits in language models


A paper from Anthropic describing persona vectors and their applications to monitoring and controlling model behavior

Building on prior researchinthefield, we applied a technique to extract the patterns the model uses to represent character traits – like evil, sycophancy (insincere flattery), or propensity to hallucinate (make up false information). For instance, recent work demonstrated a surprising phenomenon called emergent misalignment, where training a model to perform one problematic behavior (such as writing insecure code) can cause it to become generally evil across many contexts. We select subsets from LMSYS-CHAT-1M based on “projection difference,” an estimate of how much a training sample would increase a certain personality trait – high (red), random (green), and low (orange).

Get the Android app

Or read this on Hacker News

Read more on:

Photo of persona

persona

Photo of Persona Vectors

Persona Vectors

Related news:

News photo

Persona 5 X players are obsessed with the Subway Slammer, but what's his deal anyway?

News photo

75 million deepfakes blocked: Persona leads the corporate fight against hiring fraud

News photo

Persona-inspired Demonschool gets release window following last year's delay