IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

About

Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottleneck. This stage incurs a costly dequantize -> softmax -> requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer attention pipeline that serves as a training-free drop-in replacement. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup table approximation, and direct integer normalization, thereby eliminating datatype conversion overhead along the attention path. Experiments on Armv8 CPUs show that our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines, and up to 2.0x speedup over conventional INT8 attention pipelines. Across diverse language and vision models, as well as additional reasoning and long-context evaluations, IntAttention maintains strong overall fidelity and demonstrates a more favorable trade-off than existing LUT-based softmax approximations. Code is available at https://github.com/WanliZhong/IntAttention

Wanli Zhong, Haibo Feng, Zirui Zhou, Hanyang Peng, Shiqi Yu• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K 1.0 (val)	Top-1 Accuracy86.1	2386
Commonsense Reasoning	WinoGrande	Accuracy61.01	1581
Commonsense Reasoning	PIQA	Accuracy74.92	400
Mathematical Reasoning	GSM8K	Accuracy (Acc)35.03	352
Language Modeling	WikiText	Word Perplexity13.07	331
Language Modeling	LAMBADA	Accuracy63.61	114
Instruction Following	IFEval	Accuracy (IFEval)39.74	101
Code Generation	MBPP	--	87
End-to-end attention latency measurement	RK3588S2	Attention Latency (ms)2.95	20
End-to-end attention latency measurement	Apple M2	Latency (ms)0.87	20

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord