Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

About

For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms previous attention-based pruning methods with minimal latency.

Jiameng Li, Aleksei Tiulpin, Matthew B. Blaschko• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Video UnderstandingVideoMME
Score (Overall)67.1
357
Science Question AnsweringScienceQA (SQA)
Accuracy69.81
273
Long Video UnderstandingLongVideoBench--
269
Video Question AnsweringVideoMME--
251
Visual Question AnsweringTextVQA
TextVQA Accuracy55.9
210
Multimodal EvaluationMM-Vet--
196
Video Question AnsweringMSVD
Accuracy70.6
152
Visual Question AnsweringGQA
GQA Score57.01
139
Video Question AnsweringEgoSchema subset
Accuracy63
124
Showing 10 of 21 rows

Other info

Follow for update