Get the latest tech news

Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)

We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.

The features receive input from the model’s residual stream at their associated layer, but are “cross-layer” in the sense that they can provide output to all subsequent layers.Other architectures besides CLTs can be used for circuit analysis, but we’ve empirically found this approach to work well. We would like to thank the following people who reviewed an early version of the manuscript and provided helpful feedback that we used to improve the final version: Larry Abbott, Andy Arditi, Yonatan Belinkov, Yoshua Bengio, Devi Borg, Sam Bowman, Joe Carlsmith, Bilal Chughtai, Arthur Conmy, Jacob Coxon, Shaul Druckmann, Leo Gao, Liv Gorton, Helai Hesham, Sasha Hydrie, Nicholas Joseph, Harish Kamath, Tom McGrath, János Kramár, Aaron Levin, Ashok Litwin-Kumar, Rodrigo Luger, Alex Makolov, Sam Marks, Dan Mossing, Neel Nanda, Yaniv Nikankin, Senthooran Rajamanoharan, Fabien Roger, Rohin Shah, Lee Sharkey, Lewis Smith, Nick Sofroniew, Martin Wattenberg, and Jeff Wu. Emmanuel Ameisen, Joshua Batson, Brian Chen, Craig Citro, Wes Gurnee, Jack Lindsey, and Adam Pearce did initial exploration of example attribution graphs to validate and improve methodology.

Get the Android app

Or read this on Hacker News