Get the latest tech news
Show HN: Llama 3.2 Interpretability with Sparse Autoencoders
A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible. - PaulPauls/llama3_interpretability_sae
While this dataset is roughly an order of magnitude smaller than those used by Anthropic or Google DeepMind (who both utilized around 8 billion unique activations), it still provides a substantial foundation for training an initial SAE model in my opinion. This decision aims to strike a balance between several factors: providing sufficient feature capacity at approximately 21x the residual stream dimension, maintaining computational efficiency as suggested by the OpenAI and Google DeepMind papers, and staying within the project's monetary constraints when training on ~8 billion activations for comparability. While the current sentence-level analysis provides a good foundation, more nuanced approaches to pattern recognition could reveal subtler semantic features and improve our understanding of how the model represents information.
Or read this on Hacker News