Get the latest tech news

A 20-Year-Old Algorithm Can Help Us Understand Transformer Embeddings

Suppose we ask an LLM: “Can you tell me about Java?” What “Java” is the model thinking about? The programming language or the Indonesian island? To answer this question, we can try to understand what is going on inside the model. Specifically, we want to represent the model’s internal states in a human-interpretable way by finding the concepts that the model is thinking about.

In our recent paper, we show that with minor modifications, traditional methods can be scaled to sufficiently large datasets with millions of samples and thousands of dimensions and that their performance matches that of SAEs on a variety of benchmarks. Based on extrapolated timing results, a naïve implementation of KSVD would take over 30 days to produce a dictionary sufficient to interpret LLM embeddings. Further, the fact that two completely different optimization approaches (DB-KSVD and SAEs) achieve similar performance results may indicate that we are close to the theoretical limits given the problem setup.

Get the Android app

Or read this on Hacker News