Get the latest tech news

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

In this post, I will walkthrough how I learned to implement Flash Attention for 5090 in CUDA C++. The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120. I also feel this is a natural next step after learning about matmul kernels. Lastly, there are many excellent blogposts on writing fast matmul kernels, but there is none for attention. So I want to take this chance to write up something nicely.

Or you can check out GPU-MODE series ( slides, YouTube) for basic CUDA C++ knowledge, as well as the excellent matmul blogposts mentioned above, to quickly get up to speed. This NVIDIA blogpost gives a pretty good explanation of the idea, but generally I don’t really like using CUDA C++ API (and considering that CUTLASS also doesn’t, I think it’s more fun to use PTX directly). KernelTFLOPS% of SOLv2 (shared memory swizzling)181.1186.45%v3 (2-stage pipelining)189.8490.62% For the last two versions, I couldn’t identify any optimization opportunities from the profiling data (maybe just skill issue).

Get the Android app

Or read this on Hacker News