Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

About

Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present ReDiPrune, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15% of visual tokens yields a +2.0% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.

An Yu, Ting Yu Tsai, Zhenfei Zhang, Weiheng Lu, Felix X.-F. Ye, Ming-Ching Chang• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Multimodal UnderstandingMMBench
Accuracy59.88
637
Video Question AnsweringActivityNet-QA
Accuracy45.69
376
Video Question AnsweringEgoSchema
Accuracy45.6
161
Multimodal UnderstandingMME
Score1.39e+3
83
Video Question AnsweringNextQA
WUPS26.42
26
Science Question AnsweringScienceQA IMG
EM69.11
9
Video UnderstandingVideo-ChatGPT
Score2.663
8
Showing 8 of 8 rows

Other info

Follow for update