Get the latest tech news

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

ht months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety.

See Update on how we train SAEs for full details.. We performed a sweep over a narrow range of learning rates (suggested by the scaling laws analysis) and chose the value that gave the lowest loss. Adly Templeton organized a team-wide code cleanup, which Tom Conerly, Jonathan Marcus, Trenton Bricken, Hoagy Cunningham, Jack Lindsey, Brian Chen, Adam Pearce, Nick Turner, and Callum McDougall all contributed to. Paper Results Assessing Feature Interpretability – Nick Turner performed the specificity analysis with support from Jack Lindsey and Adly Templeton and guidance from Adam Jermyn and Chris Olah.

Get the Android app

Or read this on Hacker News