Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
About
Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy87.1 | 1455 | |
| Visual Question Answering | GQA | Accuracy60.2 | 1249 | |
| Multimodal Understanding | MMBench | Accuracy63.8 | 637 | |
| Visual Question Answering | ScienceQA | Accuracy71.1 | 370 | |
| Video Question Answering | MSVD | Accuracy67.1 | 152 | |
| Massive Multi-discipline Multimodal Understanding | MMMU | Accuracy37.2 | 152 | |
| Video Question Answering | MSRVTT | Accuracy53.3 | 100 | |
| Multimodal Understanding | LLaVA Evaluation Suite 1.5 | Average Score98.6 | 95 | |
| Visual Question Answering | GQA | GQA Score61.2 | 85 | |
| Multimodal Understanding | MME | Score1.80e+3 | 83 |