ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

About

Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.

Zhaohong Huang, Wenjing Liu, Yuxin Zhang, Fei Chao, Rongrong Ji• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy90.5	2019
Visual Question Answering	VizWiz	Accuracy56.2	1820
Visual Question Answering	VQA v2	Accuracy78.7	1429
Text-based Visual Question Answering	TextVQA	Accuracy56.5	962
Multimodal Understanding	MMBench	Accuracy65.4	847
Multimodal Evaluation	MME	Score2.22e+3	727
Scientific Question Answering	ScienceQA image	Accuracy94.4	259
Multimodal Understanding	MMBench CN	Accuracy57.6	254
Multimodal Evaluation	MMBench CN	Accuracy55.8	120
Multimodal Benchmarking	MMBench	Accuracy82.1	90

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord