Get the latest tech news

Writing an LLM from scratch, part 13 – attention heads are dumb


A pause to take stock: realising that attention heads are simpler than I thought explained why we do the calculations we do.

If you cast your mind back to part 5, a big problem with encoder/decoder RNNs that did not have attention mechanisms was the fixed-length bottleneck. You would run your input sequence into an encoder RNN, which would try to represent its meaning in its hidden state -- a vector of a particular fixed length -- ready to pass it on to the decoder. Now, as I said earlier, the real attention heads, having been trained by gradient descent over billions of tokens, will probably have learned something weird and abstract and not related to the way we think of language, grammar and the parts of speech.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Scratch

Scratch

Photo of LLM

LLM

Photo of attention heads

attention heads

Related news:

News photo

Fine-tuning vs. in-context learning: New research guides better LLM customization for real-world tasks

News photo

Clippy resurrected as AI assistant — project turns infamous Microsoft mascot into LLM interface

News photo

'I see you're running a local LLM. Would you like some help with that?'