DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

About

Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available $\href{https://github.com/vbdi/divprune}{\text{here}}$.

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.79	2019
Visual Question Answering	VizWiz	Accuracy62.29	1820
Visual Question Answering	TextVQA	Accuracy83.1	1453
Visual Question Answering	VQA v2	Accuracy80.4	1429
Visual Question Answering	GQA	Accuracy61.9	1425
Text-based Visual Question Answering	TextVQA	Accuracy73.1	962
Multimodal Understanding	MMBench	Accuracy85.1	847
Science Question Answering	ScienceQA	Accuracy93.8	791
Multimodal Evaluation	MME	Score2.19e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy77.2	712

Showing 10 of 313 rows

...

Other info

Follow for update

@wizwand_team Discord