FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
About
Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0$\times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6$\times$ lower numerical error than a baseline FP8 attention.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 24 | Avg@32 Accuracy93.85 | 23 | |
| Alignment | IFEval strict prompt | pass@186.9 | 16 | |
| Video Generation | LongVGenBench LongVie2 (test) | LongVGenBench Score69.67 | 15 | |
| Rolling-Forcing | LongVBench | VBench Score84.08 | 15 | |
| LLM Inference | Long-Context LLM Inference Decode | Latency (ms)0.7 | 8 | |
| General QA | MMLU-Redux | Exact Match90.48 | 7 | |
| LLM Inference | Long-Context LLM Inference (Prefill) | Prefill Latency (ms)0.76 | 6 | |
| General Reasoning | GPQA Diamond | Mean@1684.15 | 4 | |
| General Reasoning | ZebraLogic | Mean@196.1 | 4 | |
| Mathematical Reasoning | MATH 500 | Mean@10.992 | 4 |