ContextPilot: Fast Long-Context Inference via Context Reuse

About

AI applications increasingly depend on long-context inference, where LLMs consume substantial context to support stronger reasoning. Common examples include retrieval-augmented generation, agent memory layers, and multi-agent orchestration. As input contexts get longer, prefill latency becomes the main bottleneck. Yet today's prefill acceleration techniques face a trade-off: they either preserve reasoning quality but deliver little KV-cache reuse, or improve reuse at the cost of degraded reasoning quality. We present ContextPilot, a system that accelerates prefill by introducing context reuse as a new mechanism for faster long-context inference. ContextPilot introduces a context index to identify overlapping context blocks across LLM interactions (e.g., across users and turns). It further proposes context ordering and de-duplication techniques to maximize KV-cache reuse. To preserve reasoning quality under reuse, it introduces succinct context annotations that prevent quality degradation. Finally, ContextPilot is built around a modular architecture with a clean interface that integrates with existing inference engines. Extensive evaluation shows that ContextPilot reduces LLM prefill latency by up to $3\times{}$ compared to state-of-the-art methods while preserving reasoning quality. At longer context lengths, it can even improve reasoning quality. ContextPilot is open-sourced at: https://github.com/EfficientContext/ContextPilot.

Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai• 2025

Related benchmarks

Task	Dataset	Result
Multi-hop Question Answering	Multi-hop RAG	F164.68	77
LLM Serving and Question Answering	MultiHopRAG hot0.7 (test)	KV-Cache Hit Rate43.76	40
Hybrid Retrieval-Augmented Generation	Hybrid RAG	TTFT (s)0.24	20
LLM Serving	AGENTPREFIX derived from τ-bench (trace)	TTFT P50 (s)21.39	12
Multi-session Retrieval-Augmented Generation	MultiHopRAG (test)	F1 Score64.4	12
Multi-session Retrieval-Augmented Generation	NarrativeQA (test)	F1 Score38.4	12
Multi-session Retrieval-Augmented Generation	QASPER (test)	F1 Score34.9	12
Multi-turn Retrieval-Augmented Generation	MT-RAG	Accuracy75.81	11
Question Answering	NarrativeQA	Prefill Throughput (tok/s)2.47e+4	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord