Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

About

Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators.Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99$\times$ and 2.99$\times$ speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators. Our codes are available at https://github.com/ZichenWen1/DART.

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy89.2	2019
Visual Question Answering	VizWiz	Accuracy98.57	1820
Visual Question Answering	TextVQA	Accuracy82.1	1453
Visual Question Answering	VQA v2	Accuracy79.47	1429
Visual Question Answering	GQA	Accuracy61.7	1425
Text-based Visual Question Answering	TextVQA	Accuracy72.2	962
Multimodal Understanding	MMBench	Accuracy79.6	847
Science Question Answering	ScienceQA	Accuracy85.9	791
Multimodal Evaluation	MME	Score2.25e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy75.7	712

Showing 10 of 197 rows

...

Other info

Follow for update

@wizwand_team Discord