Get the latest tech news
DeepMind makes big jump toward interpreting LLMs with sparse autoencoders
A new research by Google DeepMind shows how sparse autoencoders (SAEs) with special JumpReLU activation can help interpret LLMs.
One promising approach is the sparse autoencoder (SAE), a deep learning architecture that breaks down the complex activations of a neural network into smaller, understandable components that can be associated with human-readable concepts. This kind of visibility on concepts can enable scientists to develop techniques that prevent the model from generating harmful content such as creating malicious code even when users manage to circumvent prompt safeguards through jailbreaks. For example, by changing the sparse activations and decoding them back into the model, users might be able to control aspects of the output, such as making the responses more funny, easier to read, or more technical.
Or read this on Venture Beat