MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

About

Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging strategy that merges the selected and non-selected ones to preserve information from the entire context. MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed, while maintaining or enhancing performance on various multimodal tasks in long-context settings, including multi-images and long-video scenarios. Our code is released at https://github.com/AIoT-MLSys-Lab/MEDA.

Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy60.8	1455
Mathematical Reasoning	MathVista	Accuracy51.6	382
Massive Multi-discipline Multimodal Understanding	MMMU	Accuracy53.18	249
Document Visual Question Answering	DocVQA	Accuracy60.52	203
Multimodal Evaluation	MMStar	Accuracy62.41	177
Text-to-Image Generation	HPS v2.1	Overall Score29.57	153
Text-to-Image Generation	ImageReward	ImageReward Score0.954	119
Mathematical Visual Question Answering	MathVista	Accuracy61.8	87
Instruction Following	ALFRED	Accuracy15.98	57
Multi-modal Long-context Benchmarking	MileBench	Task T Score54.24	39

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord