Get the latest tech news
The Biology of a Large Language Model
models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown.
In our companion paper, Circuit Tracing: Revealing Computational Graphs in Language Models, we build on recent work (e.g. ) to introduce a new set of tools for identifying features and mapping connections between them – analogous to neuroscientists producing a “wiring diagram” of the brain. We rely heavily on a tool we call attribution graphs, which allow us to partially trace the chain of intermediate steps that a model uses to transform a specific input prompt into an output response. These works primarily rely on the logit lens technique and component-level activation patching to show that models have an English-aligned intermediate representation, but subsequently convert this to a language-specific output in the final layers.
Or read this on Hacker News