FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

About

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0$\times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6$\times$ lower numerical error than a baseline FP8 attention.

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Mean@10.992	55
Mathematical Reasoning	AIME 25	Mean@3289.58	30
Alignment	IFEval strict prompt	pass@186.9	26
Mathematical Reasoning	AIME 24	Avg@32 Accuracy93.85	23
Video Generation	LongVGenBench LongVie2 (test)	LongVGenBench Score69.67	15
Rolling-Forcing	LongVBench	VBench Score84.08	15
Text-to-Video Generation	VBench official evaluation prompts	Semantic Score72.71	15
Text-to-Video Generation	VBench	VBench Semantic Score76.06	10
LLM Inference	Long-Context LLM Inference Decode	Latency (ms)0.7	8
General QA	MMLU-Redux	Exact Match90.48	7

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord