TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

About

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.

Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024 (test)	Accuracy71.3	209
Mathematical Reasoning	Minerva	--	138
Math Reasoning	GaoKao En 2023	Accuracy63.3	109
Mathematical Reasoning	AIME 24	AIME 24 Accuracy33.3	84
Math Reasoning	OlympiadBench	Accuracy10.9	76
Long-context retrieval	Needle-in-the-Haystack 10k-context	Accuracy100	30
Needle-in-the-haystack Retrieval	PG-19 mini 10K context	Accuracy (Needle-in-the-Haystack)100	30
Needle-In-A-Haystack Retrieval	Needle-in-a-Haystack 8K context (test)	Accuracy100	30
Needle-In-A-Haystack Retrieval	Needle-in-a-Haystack 32K context (test)	Accuracy61	30
Long-context Understanding	LongBench	MFQA30.94	18

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord