Get the latest tech news
Writing an LLM from scratch, part 8 – trainable self-attention
Moving on from a toy self-attention mechanism, it's time to find out how to build a real trainable one. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 8/??
I suspected earlier that the reason that Raschka was using that specific operation for his "toy" self-attention was that the real implementation is similar, and that has turned out right, as we're doing scaled dot products here. I'm going to loosely follow Raschka's explanation, but using mathematical notation rather than code, as (unusually for me as a career techie) I found it a bit easier to grasp what's going on that way. The names of the matrices used -- query, key and value -- hint at the roles they play in a metaphorical way; Raschka says in a sidebar that it's a nod to information retrieval systems like databases.
Or read this on Hacker News