Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

About

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a drop-in, dynamic sparse attention mechanism that accelerates inference by using only a fixed scalar threshold to skip attention blocks. Our method targets practical inference deployment by removing the barriers to adoption present in existing works. As such, BLASST eliminates training requirements, avoids expensive pre-computation passes, accelerates both prefill and decode across all major attention variants (MHA, GQA, MQA, and MLA), provides optimized support for modern hardware, and easily integrates into existing frameworks. This is achieved by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block loads, and the subsequent matrix multiplication. We demonstrate the BLASST algorithm by delivering optimized kernels with negligible latency overhead. Our automated threshold calibration procedure reveals a simple inverse relationship between optimal threshold and context length, meaning we require only a single threshold each for prefill and decode per model. Preserving benchmark accuracy, we demonstrate a 1.52x speedup for prefill at 71.9% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs.

Jiayi Yuan, Cameron Shinn, Kai Xu, Jingze Cui, George Klimiashvili, Guangxuan Xiao, Perkz Zheng, Bo Li, Yuxin Zhou, Zhouhai Ye, Weijie You, Tian Zheng, Dominic Brown, Pengbo Wang, Markus Hoehnerbach, Richard Cai, Julien Demouth, John D. Owens, Xia Hu, Song Han, Timmy Liu, Huizi Mao• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy76.5
370
Science ReasoningGPQA
Accuracy61.56
243
Long-context ReasoningLongBench
Score35.1
62
Code GenerationLiveCodeBench
Accuracy54.15
60
Mathematical ReasoningMATH500
Accuracy96.23
57
Graduate-Level ReasoningGPQA
Accuracy61.51
41
Long-context retrievalRULER
Retrieval Accuracy (8K)94.7
34
Long-context UnderstandingRULER 32k
Accuracy92.11
26
Long-context UnderstandingLongBench
Overall Average Score31.8
17
Long-context UnderstandingLongBench
Accuracy33.9
4
Showing 10 of 12 rows

Other info

Follow for update