Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
About
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy85.1 | 2019 | |
| Visual Question Answering | TextVQA | Accuracy55.6 | 1453 | |
| Visual Question Answering | VQA v2 | Accuracy74.9 | 1429 | |
| Text-based Visual Question Answering | TextVQA | Accuracy55.6 | 962 | |
| Multimodal Understanding | MMBench | Accuracy63.7 | 847 | |
| Science Question Answering | ScienceQA | Accuracy69.3 | 791 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score29.3 | 517 | |
| Multimodal Understanding | SEED-Bench | Accuracy56.4 | 516 | |
| Multimodal Understanding | MMBench CN | Accuracy56.6 | 254 | |
| Visual Question Answering | GQA (test) | Accuracy43.9 | 197 |