Get the latest tech news
Simple probes can catch sleeper agents
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
We found that there was: over a large range of residual stream layers, whether or not a prompt will induce defection is strongly linearly represented in sleeper agent model activations for coding questions, and thus could probably be detected with a variety of methods relatively easily. @online{macdiarmid2024sleeperagentprobes,author = {Monte MacDiarmid and Timothy Maxwell and Nicholas Schiefer and Jesse Mu and Jared Kaplan and David Duvenaud and Sam Bowman and Alex Tamkin and Ethan Perez and Mrinank Sharma and Carson Denison and Evan Hubinger},title = {Simple probes can catch sleeper agents},date = {2024-04-23},year = {2024},url = {https://www.anthropic.com/news/probes-catch-sleeper-agents},} We thank Adam Jermyn, Adly Templeton, Andy Jones, David Duvenaud, Joshua Batson, Neel Nanda, Nina Rimsky, Roger Grosse, Thomas Henighan and Tomek Korbak who provided helpful feedback on previous drafts.
Or read this on Hacker News