Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
About
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | GQA (test) | Accuracy43.9 | 119 | |
| Visual Question Answering | VizWiz (test) | Accuracy57.1 | 66 | |
| Object Hallucination Evaluation | POPE (test) | Accuracy67.3 | 44 | |
| Multi-modal Evaluation | MME (test) | -- | 32 | |
| Multimodal Understanding | Multimodal Evaluation Suite (GQA, MMBench, MMBench-CN, MME, POPE, SEED-Bench, TextVQA, VizWiz, OCRBench) | GQA Score55.1 | 21 | |
| Text-based Visual Question Answering | TextVQA (test) | -- | 10 | |
| Multimodal Question Answering | MMBench EN (test) | Accuracy59.5 | 9 | |
| Multimodal Question Answering | MMBench CN (test) | Accuracy59.1 | 9 | |
| OCR Evaluation | OCRBench (test) | Score9.7 | 9 |