BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
About
The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a drop-in, dynamic sparse attention mechanism that accelerates inference by using only a fixed scalar threshold to skip attention blocks. Our method targets practical inference deployment by removing the barriers to adoption present in existing works. As such, BLASST eliminates training requirements, avoids expensive pre-computation passes, accelerates both prefill and decode across all major attention variants (MHA, GQA, MQA, and MLA), provides optimized support for modern hardware, and easily integrates into existing frameworks. This is achieved by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block loads, and the subsequent matrix multiplication. We demonstrate the BLASST algorithm by delivering optimized kernels with negligible latency overhead. Our automated threshold calibration procedure reveals a simple inverse relationship between optimal threshold and context length, meaning we require only a single threshold each for prefill and decode per model. Preserving benchmark accuracy, we demonstrate a 1.52x speedup for prefill at 71.9% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2024 | Accuracy76.5 | 370 | |
| Science Reasoning | GPQA | Accuracy61.56 | 243 | |
| Long-context Reasoning | LongBench | Score35.1 | 62 | |
| Code Generation | LiveCodeBench | Accuracy54.15 | 60 | |
| Mathematical Reasoning | MATH500 | Accuracy96.23 | 57 | |
| Graduate-Level Reasoning | GPQA | Accuracy61.51 | 41 | |
| Long-context retrieval | RULER | Retrieval Accuracy (8K)94.7 | 34 | |
| Long-context Understanding | RULER 32k | Accuracy92.11 | 26 | |
| Long-context Understanding | LongBench | Overall Average Score31.8 | 17 | |
| Long-context Understanding | LongBench | Accuracy33.9 | 4 |