Get the latest tech news

DeepMind makes big jump toward interpreting LLMs with sparse autoencoders


A new research by Google DeepMind shows how sparse autoencoders (SAEs) with special JumpReLU activation can help interpret LLMs.

One promising approach is the sparse autoencoder (SAE), a deep learning architecture that breaks down the complex activations of a neural network into smaller, understandable components that can be associated with human-readable concepts. This kind of visibility on concepts can enable scientists to develop techniques that prevent the model from generating harmful content such as creating malicious code even when users manage to circumvent prompt safeguards through jailbreaks. For example, by changing the sparse activations and decoding them back into the model, users might be able to control aspects of the output, such as making the responses more funny, easier to read, or more technical.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of DeepMind

DeepMind

Photo of LLMs

LLMs

Photo of big jump

big jump

Related news:

News photo

Google's DeepMind Says Its AI Can Tackle Math Olympiad Problems

News photo

Show HN: Briefer – Multiplayer notebooks with schedules, SQL, and built-in LLMs

News photo

Show HN: Convert HTML DOM to semantic markdown for use in LLMs