Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
About
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy85.1 | 1455 | |
| Visual Question Answering | TextVQA | Accuracy55.6 | 1285 | |
| Visual Question Answering | GQA (test) | Accuracy43.9 | 188 | |
| Science Question Answering | ScienceQA SQA-IMG | Accuracy69.3 | 139 | |
| Visual Question Answering | VizWiz (test) | Accuracy57.1 | 79 | |
| Object Hallucination Evaluation | POPE (test) | Accuracy67.3 | 79 | |
| Multimodal Evaluation | MME | MME Score1.83e+3 | 73 | |
| Multi-modal Question Answering | MMBench | Accuracy63.7 | 55 | |
| Multi-modal Evaluation | MME (test) | -- | 32 | |
| Multimodal Question Answering | MMBench EN (test) | Accuracy59.5 | 26 |