Accelerating Prefilling via Decoding-time Contribution Sparsity

About

Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis. Building on this observation, we propose TriangleMix, a training-free static attention pattern that uses dense attention in a subset of layers and switches to Triangle attention in the others. Extensive experiments show that TriangleMix preserves nearly lossless performance relative to dense attention while substantially reducing attention overhead in Triangle layers. For 128K inputs, Triangle attention achieves a 15.3x speedup in attention computation, significantly exceeding the acceleration of typical dynamic sparse methods (1.9x to 3.4x). Furthermore, TriangleMix can be seamlessly combined with dynamic sparsity approaches, delivering an additional 6% to 19% reduction in TTFT over using dynamic sparsity alone. Our code is released at https://aka.ms/TriangleMix.

Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu• 2025

Related benchmarks

Task	Dataset	Result
Long-context Understanding	LongBench v2	Overall Score30.53	185
Long-context evaluation	RULER	Average Accuracy Score90.2	59
Mathematical Reasoning	GSM-Infinite (8K)	Accuracy22.9	24
Mathematical Reasoning	GSM-Infinite 16K	Accuracy16.2	24
Long-context multi-task evaluation	LongBench-e	Qasper45.1	24
Long-context Understanding	RULER	Performance (8K Context)92.44	24
Long-context Understanding	RULER (test)	Accuracy (4K Context)96.3	24
Mathematical Reasoning	Math GSM8K AIME24	Accuracy (GSM8K)46.3	24
Mathematical Reasoning	GSM-Infinite (Avg)	Accuracy16.9	24
Mathematical Reasoning	GSM-Infinite 32K	Accuracy14.1	24

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord