Get the latest tech news
FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention
In theory, Attention is All You Need. In practice, however, we also need optimized attention implementations like FlashAttention.
This operates as a sort of “software lottery” for ML researchers - if your attention variant doesn’t fit into one of the existing optimized kernels, you’re doomed to slow runtime and CUDA OOMs. We provide a flexible API that allows implementing many attention variants (including all the ones mentioned in the blog post so far) in a few lines of idiomatic PyTorch code. Although FlexAttention will not need to recompile when it changes, if you aren’t careful about caching it, it can lead to significant slowdowns (check out the FAQ for suggestions on best practices).
Or read this on Hacker News