Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models
About
Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention has received increasing attention as a scalable alternative. However, existing sparse attention methods rely on coarse-grained semantic representations during block selection, which blur intra-block semantic boundaries and lead to the loss of critical information. To address this issue, we propose \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}, a natively trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors. Specifically, (1) we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead; (2) we introduce an extreme-sparsity-adaptive training and inference strategy that stabilizes model behavior under very low token activation ratios; Extensive experiments on general benchmarks and long-context evaluations demonstrate that PHSA consistently outperforms dense attention and state-of-the-art sparse attention baselines, including InfLLM v2. Specifically, for the 0.6B-parameter model with 32k-token input sequences, PHSA can reduce the information loss by 10.8\% at a sparsity ratio of 97.3\%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | -- | 1460 | |
| Code Generation | HumanEval | -- | 850 | |
| Mathematical Reasoning | MATH | -- | 643 | |
| Science Question Answering | ARC Challenge | Accuracy41.97 | 234 | |
| Mathematical Reasoning | GSM8K | Math Score56.63 | 171 | |
| Word Prediction | LAMBADA | Accuracy50.13 | 112 | |
| Science Question Answering | ARC Easy | Accuracy72.69 | 101 | |
| Mathematical Reasoning | MathQA | Accuracy42.91 | 95 | |
| Complex Reasoning | BBH | Accuracy39.56 | 40 | |
| Commonsense Reasoning | XStoryCloze | Average Score59.76 | 32 |