Get the latest tech news

Persona vectors: Monitoring and controlling character traits in language models

A paper from Anthropic describing persona vectors and their applications to monitoring and controlling model behavior

Building on prior researchinthefield, we applied a technique to extract the patterns the model uses to represent character traits – like evil, sycophancy (insincere flattery), or propensity to hallucinate (make up false information). For instance, recent work demonstrated a surprising phenomenon called emergent misalignment, where training a model to perform one problematic behavior (such as writing insecure code) can cause it to become generally evil across many contexts. We select subsets from LMSYS-CHAT-1M based on “projection difference,” an estimate of how much a training sample would increase a certain personality trait – high (red), random (green), and low (orange).

Get the Android app

Or read this on Hacker News