BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

About

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a drop-in, dynamic sparse attention mechanism that accelerates inference by using only a fixed scalar threshold to skip attention blocks. Our method targets practical inference deployment by removing the barriers to adoption present in existing works. As such, BLASST eliminates training requirements, avoids expensive pre-computation passes, accelerates both prefill and decode across all major attention variants (MHA, GQA, MQA, and MLA), provides optimized support for modern hardware, and easily integrates into existing frameworks. This is achieved by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block loads, and the subsequent matrix multiplication. We demonstrate the BLASST algorithm by delivering optimized kernels with negligible latency overhead. Our automated threshold calibration procedure reveals a simple inverse relationship between optimal threshold and context length, meaning we require only a single threshold each for prefill and decode per model. Preserving benchmark accuracy, we demonstrate a 1.52x speedup for prefill at 71.9% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs.

Jiayi Yuan, Cameron Shinn, Kai Xu, Jingze Cui, George Klimiashvili, Guangxuan Xiao, Perkz Zheng, Bo Li, Yuxin Zhou, Zhouhai Ye, Weijie You, Tian Zheng, Dominic Brown, Pengbo Wang, Markus Hoehnerbach, Richard Cai, Julien Demouth, John D. Owens, Xia Hu, Song Han, Timmy Liu, Huizi Mao• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy76.5	370
Science Reasoning	GPQA	Accuracy61.56	243
Mathematical Reasoning	MATH500	Accuracy96.23	76
Code Generation	LiveCodeBench	Accuracy54.15	64
Long-context Reasoning	LongBench	Score35.1	62
Long-context retrieval	RULER	Retrieval Accuracy (8K)94.7	44
Graduate-Level Reasoning	GPQA	Accuracy61.51	44
Long-context Understanding	RULER 32k	Accuracy92.11	38
Long-context Understanding	LongBench	Overall Average Score31.8	17
Long-context Understanding	LongBench	Accuracy33.9	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord