ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
About
Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VizWiz | Accuracy56.2 | 1525 | |
| Object Hallucination Evaluation | POPE | Accuracy90.5 | 1455 | |
| Visual Question Answering | VQA v2 | Accuracy78.7 | 1362 | |
| Text-based Visual Question Answering | TextVQA | Accuracy56.5 | 807 | |
| Multimodal Evaluation | MME | Score2.22e+3 | 658 | |
| Multimodal Understanding | MMBench | Accuracy65.4 | 637 | |
| Scientific Question Answering | ScienceQA image | Accuracy94.4 | 184 | |
| Multimodal Understanding | MMBench CN | Accuracy57.6 | 174 | |
| Multimodal Evaluation | MMBench CN | Accuracy55.8 | 83 | |
| Multimodal Benchmarking | MMBench | Accuracy82.1 | 58 |