Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

About

Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable performance. Our code is available at https://github.com/Theia-4869/VisPruner.

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy80.8
1165
Visual Question AnsweringTextVQA
Accuracy57.4
1117
Object Hallucination EvaluationPOPE
Accuracy85.9
935
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy71.9
664
Multimodal EvaluationMME
Score1.66e+3
557
Text-based Visual Question AnsweringTextVQA
Accuracy62.5
496
Video Question AnsweringMSRVTT-QA
Accuracy56.7
481
Visual Question AnsweringGQA
Accuracy52.2
374
Multimodal UnderstandingMMBench--
367
Video Question AnsweringMSVD-QA
Accuracy70.2
340
Showing 10 of 68 rows

Other info

Follow for update