Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

About

In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy77.1
1165
Visual Question AnsweringTextVQA
Accuracy57.7
1117
Visual Question AnsweringVizWiz
Accuracy51.4
1043
Visual Question AnsweringGQA
Accuracy58.9
963
Object Hallucination EvaluationPOPE
Accuracy84.9
935
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy75.6
664
Multimodal EvaluationMME
Score2.06e+3
557
Text-based Visual Question AnsweringTextVQA
Accuracy58.4
496
Robot ManipulationLIBERO
Goal Achievement97.6
494
Video Question AnsweringMSRVTT-QA
Accuracy31
481
Showing 10 of 143 rows
...

Other info

Follow for update