Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

About

Large language models (LLMs) encounter computational challenges during long-sequence inference, especially in the attention pre-filling phase, where the complexity grows quadratically with the prompt length. Previous efforts to mitigate these challenges have relied on fixed sparse attention patterns or identifying sparse attention patterns based on limited cases. However, these methods lacked the flexibility to efficiently adapt to varying input demands. In this paper, we introduce FlexPrefill, a Flexible sparse Pre-filling mechanism that dynamically adjusts sparse attention patterns and computational budget in real-time to meet the specific requirements of each input and attention head. The flexibility of our method is demonstrated through two key innovations: 1) Query-Aware Sparse Pattern Determination: By measuring Jensen-Shannon divergence, this component adaptively switches between query-specific diverse attention patterns and predefined attention patterns. 2) Cumulative-Attention Based Index Selection: This component dynamically selects query-key indexes to be computed based on different attention patterns, ensuring the sum of attention scores meets a predefined threshold. FlexPrefill adaptively optimizes the sparse pattern and sparse ratio of each attention head based on the prompt, enhancing efficiency in long-sequence inference tasks. Experimental results show significant improvements in both speed and accuracy over prior methods, providing a more flexible and efficient solution for LLM inference.

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy84.15
983
Code GenerationHumanEval
Pass@181.1
850
Long-context Language UnderstandingLongBench
M-Avg47.48
219
Video UnderstandingVideoMME
Overall Score70.34
192
Long-context UnderstandingLongBench
Overall Average Score36.13
115
Video UnderstandingVideo-MME without subtitles
Overall Score65
67
Long-context UnderstandingRULER
Performance @ 4K Context97.33
65
Long-context language modeling evaluationHELMET
Average Sparsity37.88
28
Long-context UnderstandingLongBench
Overall Average Score25.7
17
Long-context retrievalRULER
Retrieval Accuracy (8K)93.67
17
Showing 10 of 22 rows

Other info

Follow for update