Get the latest tech news
We reverse-engineered Flash Attention 4
Asynchrony, fast approximate exponents, and 10x more efficient softmax.
We’ve recently been contributing to open source LLM inference engines, so we read the code and reverse-engineered how the kernel works, including two math tricks (faster approximate exponentials and a more efficient online softmax) that are classic Dao. When Ian Buck and others designed CUDA C, they were driven by a north star: can it be used to write a single precision vector addition ( saxpy) with respectable performance as a clean one-liner that’s easily understood by a C programmer? Ignore the name and don’t try to come up with an interpretation of the attention scores as the probability distribution for a random variable; it’ll make your head hurt and give you bad intuition about Transformers.
Or read this on Hacker News