SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
About
In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy77.1 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy57.7 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy51.4 | 1043 | |
| Visual Question Answering | GQA | Accuracy58.9 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy84.9 | 935 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy75.6 | 664 | |
| Multimodal Evaluation | MME | Score2.06e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy58.4 | 496 | |
| Robot Manipulation | LIBERO | Goal Achievement97.6 | 494 | |
| Video Question Answering | MSRVTT-QA | Accuracy31 | 481 |