FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

About

Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.

Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, Ran He• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	VideoMME	Score (Overall)72	357
Long-context Language Understanding	InfiniteBench	En.Sum33.01	88
Long-context language modeling evaluation	RULER	Score (4K)97.27	49
Latency Measurement	LLaMA-8B-Instruct Chunked Prefill 3.1 (inference)	Attention Latency (ms)464.5	49
Long-context Understanding	RULER 32k	Accuracy91.55	38
Long-context Understanding	RULER 64k	Accuracy87.73	37
End-to-end Time-to-First-Token (TTFT)	Long-context sequences	TTFT (ms)257	36
Long-context Understanding	RULER 128k	Accuracy86.77	27
Long-context Understanding	RULER	--	27
Prefill Latency Measurement	LLaMA 8B Instruct 16K context length 3.1	Attention Prefill Latency1.36e+3	7

Showing 10 of 16 rows

Other info

GitHub

Follow for update

@wizwand_team Discord