Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ContextPilot: Fast Long-Context Inference via Context Reuse

About

AI applications increasingly depend on long-context inference, where LLMs consume substantial context to support stronger reasoning. Common examples include retrieval-augmented generation, agent memory layers, and multi-agent orchestration. As input contexts get longer, prefill latency becomes the main bottleneck. Yet today's prefill acceleration techniques face a trade-off: they either preserve reasoning quality but deliver little KV-cache reuse, or improve reuse at the cost of degraded reasoning quality. We present ContextPilot, a system that accelerates prefill by introducing context reuse as a new mechanism for faster long-context inference. ContextPilot introduces a context index to identify overlapping context blocks across LLM interactions (e.g., across users and turns). It further proposes context ordering and de-duplication techniques to maximize KV-cache reuse. To preserve reasoning quality under reuse, it introduces succinct context annotations that prevent quality degradation. Finally, ContextPilot is built around a modular architecture with a clean interface that integrates with existing inference engines. Extensive evaluation shows that ContextPilot reduces LLM prefill latency by up to $3\times{}$ compared to state-of-the-art methods while preserving reasoning quality. At longer context lengths, it can even improve reasoning quality. ContextPilot is open-sourced at: https://github.com/EfficientContext/ContextPilot.

Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai• 2025

Related benchmarks

TaskDatasetResultRank
Multi-hop Question AnsweringMulti-hop RAG
F164.68
65
Hybrid Retrieval-Augmented GenerationHybrid RAG
TTFT (s)0.24
20
Multi-session Retrieval-Augmented GenerationMultiHopRAG (test)
F1 Score64.4
12
Multi-session Retrieval-Augmented GenerationNarrativeQA (test)
F1 Score38.4
12
Multi-session Retrieval-Augmented GenerationQASPER (test)
F1 Score34.9
12
Multi-turn Retrieval-Augmented GenerationMT-RAG
Accuracy75.81
11
Question AnsweringNarrativeQA
Prefill Throughput (tok/s)2.47e+4
6
Showing 7 of 7 rows

Other info

Follow for update