Get the latest tech news

Writing an LLM from scratch, part 8 – trainable self-attention


Moving on from a toy self-attention mechanism, it's time to find out how to build a real trainable one. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 8/??

I suspected earlier that the reason that Raschka was using that specific operation for his "toy" self-attention was that the real implementation is similar, and that has turned out right, as we're doing scaled dot products here. I'm going to loosely follow Raschka's explanation, but using mathematical notation rather than code, as (unusually for me as a career techie) I found it a bit easier to grasp what's going on that way. The names of the matrices used -- query, key and value -- hint at the roles they play in a metaphorical way; Raschka says in a sidebar that it's a nod to information retrieval systems like databases.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Scratch

Scratch

Photo of LLM

LLM

Photo of attention

attention

Related news:

News photo

Show HN: Fork of Claude-code working with local and other LLM providers

News photo

Proof that LLM just mimic logical without understanding underlying logic

News photo

Go-attention: A full attention mechanism and transformer in pure Go