Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

About

In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy86.45
2019
Visual Question AnsweringVizWiz
Accuracy56.3
1820
Visual Question AnsweringTextVQA
Accuracy60.6
1453
Visual Question AnsweringVQA v2
Accuracy78.31
1429
Visual Question AnsweringGQA
Accuracy61.2
1425
Text-based Visual Question AnsweringTextVQA
Accuracy59.7
962
Robot ManipulationLIBERO
Object Achievement94.2
957
Multimodal UnderstandingMMBench
Accuracy65.7
847
Science Question AnsweringScienceQA
Accuracy73.1
791
Multimodal EvaluationMME
Score2.06e+3
727
Showing 10 of 303 rows
...

Other info

Follow for update