Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Sparser Block-Sparse Attention via Token Permutation

About

Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu• 2025

Related benchmarks

TaskDatasetResultRank
Long-context UnderstandingLongBench v2
Overall Score34.39
133
Long-context language modelingRULER
Accuracy (8K Context)93.85
75
Long-context Language UnderstandingLongBench v2
Overall Accuracy32
62
Long-context language evaluationLongBench
SQA47.04
19
Long-context Language UnderstandingLongBench
Single-Doc QA Accuracy48
12
Long-context Language UnderstandingLongBench
Average Score41.95
12
Long-context language modelingInfiniteBench
En. Sum Accuracy18
10
Showing 7 of 7 rows

Other info

Follow for update