ContextPilot: Fast Long-Context Inference via Context Reuse
About
AI applications increasingly depend on long-context inference, where LLMs consume substantial context to support stronger reasoning. Common examples include retrieval-augmented generation, agent memory layers, and multi-agent orchestration. As input contexts get longer, prefill latency becomes the main bottleneck. Yet today's prefill acceleration techniques face a trade-off: they either preserve reasoning quality but deliver little KV-cache reuse, or improve reuse at the cost of degraded reasoning quality. We present ContextPilot, a system that accelerates prefill by introducing context reuse as a new mechanism for faster long-context inference. ContextPilot introduces a context index to identify overlapping context blocks across LLM interactions (e.g., across users and turns). It further proposes context ordering and de-duplication techniques to maximize KV-cache reuse. To preserve reasoning quality under reuse, it introduces succinct context annotations that prevent quality degradation. Finally, ContextPilot is built around a modular architecture with a clean interface that integrates with existing inference engines. Extensive evaluation shows that ContextPilot reduces LLM prefill latency by up to $3\times{}$ compared to state-of-the-art methods while preserving reasoning quality. At longer context lengths, it can even improve reasoning quality. ContextPilot is open-sourced at: https://github.com/EfficientContext/ContextPilot.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | Multi-hop RAG | F164.68 | 65 | |
| Hybrid Retrieval-Augmented Generation | Hybrid RAG | TTFT (s)0.24 | 20 | |
| Multi-session Retrieval-Augmented Generation | MultiHopRAG (test) | F1 Score64.4 | 12 | |
| Multi-session Retrieval-Augmented Generation | NarrativeQA (test) | F1 Score38.4 | 12 | |
| Multi-session Retrieval-Augmented Generation | QASPER (test) | F1 Score34.9 | 12 | |
| Multi-turn Retrieval-Augmented Generation | MT-RAG | Accuracy75.81 | 11 | |
| Question Answering | NarrativeQA | Prefill Throughput (tok/s)2.47e+4 | 6 |