Get the latest tech news

We reverse-engineered Flash Attention 4


Asynchrony, fast approximate exponents, and 10x more efficient softmax.

We’ve recently been contributing to open source LLM inference engines, so we read the code and reverse-engineered how the kernel works, including two math tricks (faster approximate exponentials and a more efficient online softmax) that are classic Dao. When Ian Buck and others designed CUDA C, they were driven by a north star: can it be used to write a single precision vector addition ( saxpy) with respectable performance as a clean one-liner that’s easily understood by a C programmer? Ignore the name and don’t try to come up with an interpretation of the attention scores as the probability distribution for a random variable; it’ll make your head hurt and give you bad intuition about Transformers.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of flash attention

flash attention

Related news:

News photo

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

News photo

Implement Flash Attention Back End in SGLang – Basics and KV Cache