Get the latest tech news

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability


1 Introduction The logit lens (nostalgebraist 2020) is a simple yet powerful tool for understanding how transformer models (Vaswani et al. 2017; Brown et al.

() examines GPT-2 circuits for the Indirect Object Identification task, using the last hidden state norm to determine each layer’s contribution to the output, similar to our residual prism. Mirzadeh, Iman, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of logit prisms:

logit prisms:

Photo of transformer outputs

transformer outputs