Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

About

Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.

Yujie Chen, Tailai Chen, Yifeng Gao, Zoe Wanying He, Yijue Xu, Shaobo Wang, Linfeng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.93
2019
Visual Question AnsweringVizWiz
Accuracy61.46
1820
Visual Question AnsweringTextVQA--
1453
Visual Question AnsweringGQA
Accuracy59.8
1425
Multimodal UnderstandingMMBench--
847
Visual Question AnsweringChartQA
Accuracy62.6
519
Optical Character RecognitionOCRBench
Score44.6
433
OCR EvaluationOCRBench
Score60
350
Visual Question AnsweringAI2D
Accuracy70.27
317
Multimodal UnderstandingSEED
Accuracy70.9
216
Showing 10 of 29 rows

Other info

Follow for update