Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

About

Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention's inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens -- all while preserving full-attention-level accuracy.

Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, Mao Yang• 2025

Related benchmarks

TaskDatasetResultRank
Long-context Language UnderstandingLongBench--
294
Long-context UnderstandingRULER 32k
Accuracy94.41
38
Long-context UnderstandingRULER 64k
Accuracy92.37
37
ReasoningAIME 24--
30
Long-context Language UnderstandingRULER 16k context length--
21
ReasoningGPQA
Avg@868
13
Long-context Language UnderstandingRULER 128k
Accuracy89.49
10
Showing 7 of 7 rows

Other info

Follow for update