Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

About

Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottleneck. This stage incurs a costly dequantize -> softmax -> requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer attention pipeline that serves as a training-free drop-in replacement. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup table approximation, and direct integer normalization, thereby eliminating datatype conversion overhead along the attention path. Experiments on Armv8 CPUs show that our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines, and up to 2.0x speedup over conventional INT8 attention pipelines. Across diverse language and vision models, as well as additional reasoning and long-context evaluations, IntAttention maintains strong overall fidelity and demonstrates a more favorable trade-off than existing LUT-based softmax approximations. Code is available at https://github.com/WanliZhong/IntAttention

Wanli Zhong, Haibo Feng, Zirui Zhou, Hanyang Peng, Shiqi Yu• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy86.1
2238
Commonsense ReasoningWinoGrande
Accuracy61.01
1442
Mathematical ReasoningGSM8K
Accuracy (Acc)35.03
337
Language ModelingWikiText
Word Perplexity13.07
234
Commonsense ReasoningPIQA
Accuracy74.92
213
Language ModelingLAMBADA
Accuracy63.61
103
Instruction FollowingIFEval
Accuracy (IFEval)39.74
89
Code GenerationMBPP--
79
End-to-end attention latency measurementRK3588S2
Attention Latency (ms)2.95
20
End-to-end attention latency measurementApple M2
Latency (ms)0.87
20
Showing 10 of 13 rows

Other info

Follow for update