Get the latest tech news

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention


In theory, Attention is All You Need. In practice, however, we also need optimized attention implementations like FlashAttention.

This operates as a sort of “software lottery” for ML researchers - if your attention variant doesn’t fit into one of the existing optimized kernels, you’re doomed to slow runtime and CUDA OOMs. We provide a flexible API that allows implementing many attention variants (including all the ones mentioned in the blog post so far) in a few lines of idiomatic PyTorch code. Although FlexAttention will not need to recompile when it changes, if you aren’t careful about caching it, it can lead to significant slowdowns (check out the FAQ for suggestions on best practices).

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Performance

Performance

Photo of PyTorch

PyTorch

Photo of flexibility

flexibility

Related news:

News photo

C++'s `noexcept` can sometimes help or hurt performance

News photo

AMD Releases ROCm 6.2 With New Components, Improves PyTorch & TensorFlow

News photo

ASUS Zenbook S 16 UM5606WA Platform Profile Impact On Performance & Power