Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

About

Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID.

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87.1
2019
Visual Question AnsweringTextVQA
Accuracy73.46
1453
Visual Question AnsweringVQA v2
Accuracy74.8
1429
Visual Question AnsweringGQA
Accuracy60.2
1425
Multimodal UnderstandingMMBench
Accuracy63.8
847
Visual Question AnsweringChartQA
Accuracy76.92
519
Visual Question AnsweringScienceQA
Accuracy71.1
446
Optical Character RecognitionOCRBench
Score726
433
Massive Multi-discipline Multimodal UnderstandingMMMU
Accuracy37.2
216
Visual Question AnsweringTextVQA
TextVQA Accuracy65.89
210
Showing 10 of 53 rows

Other info

Follow for update