Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

About

In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy56.3
1525
Object Hallucination EvaluationPOPE
Accuracy86.45
1455
Visual Question AnsweringVQA v2
Accuracy78.31
1362
Visual Question AnsweringTextVQA
Accuracy60.6
1285
Visual Question AnsweringGQA
Accuracy61.2
1249
Text-based Visual Question AnsweringTextVQA
Accuracy59.7
807
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy75.6
706
Robot ManipulationLIBERO
Goal Achievement97.6
700
Multimodal EvaluationMME
Score2.06e+3
658
Multimodal UnderstandingMMBench
Accuracy65.7
637
Showing 10 of 251 rows
...

Other info

Follow for update