Get the latest tech news

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision


precision Attention, as a core layer of the ubiquitous Transformer architecture, is a bottleneck for large language models and long-context applications. FlashAttention (and FlashAttention-2) pioneered an approach to speed up attention on GPUs by minimizing memory reads/writes, and is now used by most libraries to accelerate Transformer training and inference.

In this blogpost, we describe three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) incoherent processing that leverages hardware support for FP8 low-precision. FlashAttention is an algorithm that reorders the attention computation and leverages tiling and recomputation to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. In our experiment where Q, K, V are generated from a standard normal distribution but 0.1% of the entries have large magnitudes (to simulate outliers), we found that incoherent processing can reduce the quantization error by 2.6x.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of attention

attention

Photo of precision

precision

Related news:

News photo

Attention Is a Proof of Love

News photo

Gaming competes with YouTube for Gen Alpha’s attention, ad recall | Precise TV

News photo

Illuminate with Precision: The Importance of Shadowless Surgical Lights