SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

About

In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.45	2019
Visual Question Answering	VizWiz	Accuracy56.3	1820
Visual Question Answering	TextVQA	Accuracy60.6	1453
Visual Question Answering	VQA v2	Accuracy78.31	1429
Visual Question Answering	GQA	Accuracy61.2	1425
Text-based Visual Question Answering	TextVQA	Accuracy59.7	962
Robot Manipulation	LIBERO	Object Achievement94.2	957
Multimodal Understanding	MMBench	Accuracy65.7	847
Science Question Answering	ScienceQA	Accuracy73.1	791
Multimodal Evaluation	MME	Score2.06e+3	727

Showing 10 of 303 rows

...

Other info

Follow for update

@wizwand_team Discord