Get the latest tech news

Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)


We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.

The features receive input from the model’s residual stream at their associated layer, but are “cross-layer” in the sense that they can provide output to all subsequent layers.Other architectures besides CLTs can be used for circuit analysis, but we’ve empirically found this approach to work well. We would like to thank the following people who reviewed an early version of the manuscript and provided helpful feedback that we used to improve the final version: Larry Abbott, Andy Arditi, Yonatan Belinkov, Yoshua Bengio, Devi Borg, Sam Bowman, Joe Carlsmith, Bilal Chughtai, Arthur Conmy, Jacob Coxon, Shaul Druckmann, Leo Gao, Liv Gorton, Helai Hesham, Sasha Hydrie, Nicholas Joseph, Harish Kamath, Tom McGrath, János Kramár, Aaron Levin, Ashok Litwin-Kumar, Rodrigo Luger, Alex Makolov, Sam Marks, Dan Mossing, Neel Nanda, Yaniv Nikankin, Senthooran Rajamanoharan, Fabien Roger, Rohin Shah, Lee Sharkey, Lewis Smith, Nick Sofroniew, Martin Wattenberg, and Jeff Wu. Emmanuel Ameisen, Joshua Batson, Brian Chen, Craig Citro, Wes Gurnee, Jack Lindsey, and Adam Pearce did initial exploration of example attribution graphs to validate and improve methodology.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of language models

language models

Photo of computational graphs

computational graphs

Photo of circuit tracing

circuit tracing

Related news:

News photo

VLMaterial: Procedural Material Generation with Large Vision-Language Models

News photo

TopoNets: High-Performing Vision and Language Models with Brain-Like Topography

News photo

TinyStories: How Small Can Language Models Be and Still Speak Coherent English? (2023)