Get the latest tech news
Writing an LLM from scratch, part 13 – attention heads are dumb
A pause to take stock: realising that attention heads are simpler than I thought explained why we do the calculations we do.
If you cast your mind back to part 5, a big problem with encoder/decoder RNNs that did not have attention mechanisms was the fixed-length bottleneck. You would run your input sequence into an encoder RNN, which would try to represent its meaning in its hidden state -- a vector of a particular fixed length -- ready to pass it on to the decoder. Now, as I said earlier, the real attention heads, having been trained by gradient descent over billions of tokens, will probably have learned something weird and abstract and not related to the way we think of language, grammar and the parts of speech.
Or read this on Hacker News