Get the latest tech news
Persona vectors: Monitoring and controlling character traits in language models
A paper from Anthropic describing persona vectors and their applications to monitoring and controlling model behavior
Building on prior researchinthefield, we applied a technique to extract the patterns the model uses to represent character traits – like evil, sycophancy (insincere flattery), or propensity to hallucinate (make up false information). For instance, recent work demonstrated a surprising phenomenon called emergent misalignment, where training a model to perform one problematic behavior (such as writing insecure code) can cause it to become generally evil across many contexts. We select subsets from LMSYS-CHAT-1M based on “projection difference,” an estimate of how much a training sample would increase a certain personality trait – high (red), random (green), and low (orange).
Or read this on Hacker News