Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

About

Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users. We observe that LRMs frequently generate highly similar intermediate reasoning steps, which, in turn, correspond to highly similar KV cache states across layers. Building on this insight, we propose ReasonCache, a novel KV cache management approach designed to improve the QoS of AI inference systems. ReasonCache utilizes a Collaborative Filtering Algorithm to efficiently identify reusable KV cache blocks and enables zero-copy cache reuse. Experimental evaluation demonstrates that ReasonCache achieves a peak throughput improvement of 89.2% and an average gain of 40-60%, leading to more responsive and cost-effective AI inference services. Notably, this performance is achieved while maintaining higher accuracy compared to existing KV cache management techniques.

Kaiwen Chen, Xin Tan, Minchen Yu, Jingzong Li, Hong Xu• 2025

Related benchmarks

TaskDatasetResultRank
MathMATH 500
Accuracy91.8
25
MathAIME 2024
Accuracy80
15
SummarizationTREC
Accuracy65.2
15
Code GenerationHumanEval
Accuracy (HumanEval)97.2
15
MathAIME 2025
Accuracy63.3
15
Multi-Doc QAHotpotQA
Accuracy41
15
ScienceGPQA Diamond
Accuracy60.7
15
MathematicsAIME 2024
Accuracy80
9
SummarizationTREC
Accuracy65.2
9
Code GenerationHumanEval
Accuracy97.2
9
Showing 10 of 14 rows

Other info

Follow for update