Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

About

Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms -- vs 37.9% for GPTCache and 68.8% for a 20B-parameter LLM at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.

Abhinaba Basu• 2026

Related benchmarks

Task	Dataset	Result
Intent Classification	Banking77 (test)	Accuracy82.6	196
Intent Classification	Clinc150 (test)	Accuracy85.9	26
Intent Classification	MASSIVE W5H2	Cost/1K0.00e+0	7
Intent Classification	NyayaBench v2 (test)	Accuracy62.6	6
Intent Classification	MASSIVE W5H2 (test)	Accuracy84.4	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord