Make Your LVLM KV Cache More Lightweight

About

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.

Xihao Chen, Yangyang Guo, Roger Zimmermann• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Visual Question Answering	GQA	GQA Score63	152
Visual Question Answering	VizWiz	VW Score69.4	25
Science Question Answering	SQA	SQA Score97	22
Multimodal Evaluation	COCO	Average Percentage99.94	16
Multi-modal Evaluation	MME	Cognition Score (C)647.5	12
Multimodal Evaluation	MME	MME-C Score590	11
Image Captioning	COCO	COCO Score91	11
Image Captioning	NoCaps	NC Score43.5	6
Image Captioning	MS-COCO	COCO Score38.9	6

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord