Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

About

Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.

Zhaohong Huang, Wenjing Liu, Yuxin Zhang, Fei Chao, Rongrong Ji• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy56.2
1525
Object Hallucination EvaluationPOPE
Accuracy90.5
1455
Visual Question AnsweringVQA v2
Accuracy78.7
1362
Text-based Visual Question AnsweringTextVQA
Accuracy56.5
807
Multimodal EvaluationMME
Score2.22e+3
658
Multimodal UnderstandingMMBench
Accuracy65.4
637
Scientific Question AnsweringScienceQA image
Accuracy94.4
184
Multimodal UnderstandingMMBench CN
Accuracy57.6
174
Multimodal EvaluationMMBench CN
Accuracy55.8
83
Multimodal BenchmarkingMMBench
Accuracy82.1
58
Showing 10 of 18 rows

Other info

Follow for update