Attn-QAT: 4-Bit Attention With Quantization-Aware Training

About

Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware training (QAT) for attention. We find that "drop-in" QAT, which naively combines an FP4 forward pass with a high-precision Flash Attention (FA)-style backward pass, leads to training instability. We identify two key principles for stable FP4 attention: (1) matching low-precision recomputation of attention scores in the backward pass, and (2) resolving implicit precision assumptions in FA's gradient calculation. Based on these insights, we propose Attn-QAT and implement fused Triton kernels for training as well as FP4 inference kernels. Across diffusion and language models, Attn-QAT recovers the quality drop from FP4 attention without explicit outlier-mitigation heuristics used in prior FP4 attention, and delivers up to a 1.5x speedup on an RTX 5090. Video demos can be found at https://drive.google.com/drive/folders/190F6xbBDUF2kGQYIcXBt3ehSYij5jlim?usp=sharing.

Peiyuan Zhang, Matthew Noto, Wenxuan Tan, Chengquan Jiang, Will Lin, Wei Zhou, Hao Zhang• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy79.4	1442
Mathematical Reasoning	GSM8K	Accuracy92.95	1398
Language Understanding	MMLU	Accuracy80.44	844
Instruction Following	IFEval	IFEval Accuracy86.37	836
Language Modeling	WikiText	PPL0.3076	740
Science Question Answering	ARC-C	Accuracy61.53	261
commonsense inference	HellaSwag	Accuracy85.57	123
Mathematical Problem Solving	MATH500	Accuracy80.6	83
Physical Commonsense Reasoning	PIQA	Accuracy83.51	78
Graduate-Level Reasoning	GPQA Diamond	Accuracy44.95	40

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord