Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SPLA: Block Sparse Plus Linear Attention for Long Context Modeling

About

Block-wise sparse attention offers significant efficiency gains for long-context modeling, yet existing methods often suffer from low selection fidelity and cumulative contextual loss by completely discarding unselected blocks. To address these limitations, we introduce Sparse Plus Linear Attention (SPLA), a framework that utilizes a selection metric derived from second-order Taylor expansions to accurately identify relevant blocks for exact attention. Instead of discarding the remaining "long tail," SPLA compresses unselected blocks into a compact recurrent state via a residual linear attention (RLA) module. Crucially, to avoid IO overhead, we derive an optimized subtraction-based formulation for RLA -- calculating the residual as the difference between global and selected linear attention -- ensuring that unselected blocks are never explicitly accessed during inference. Our experiments demonstrate that SPLA closes the performance gap in continual pretraining, surpassing dense attention models on long-context benchmarks like RULER while maintaining competitive general knowledge and reasoning capabilities.

Bailin Wang, Dan Friedman, Tao Lei, Chong Wang• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
Pass@186.6
850
Mathematical ReasoningAIME 2025
Pass@175.1
96
Mathematical ReasoningAIME 2024
Pass@178.3
86
Mathematical ReasoningHMMT 2025--
38
General Knowledge EvaluationGeneral Knowledge Evaluation Suite (ARC, HellaSwag, LAMBADA, PIQA, SciQ, WinoGrande, TriviaQA, WebQS, MMLU, GSM8K)
ARC-C60.2
5
Scientific ReasoningGPQA Diamond
pass@169.5
5
Long-context retrievalRULER
Retrieval Accuracy (4K Context)95.9
5
Multitask Language UnderstandingMMLU-Pro
pass@10.793
5
Programming ReasoningLiveCodeBench v5
pass@162.4
5
Showing 9 of 9 rows

Other info

Follow for update