Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

About

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA (test)
Accuracy43.9
119
Visual Question AnsweringVizWiz (test)
Accuracy57.1
66
Object Hallucination EvaluationPOPE (test)
Accuracy67.3
44
Multi-modal EvaluationMME (test)--
32
Multimodal UnderstandingMultimodal Evaluation Suite (GQA, MMBench, MMBench-CN, MME, POPE, SEED-Bench, TextVQA, VizWiz, OCRBench)
GQA Score55.1
21
Text-based Visual Question AnsweringTextVQA (test)--
10
Multimodal Question AnsweringMMBench EN (test)
Accuracy59.5
9
Multimodal Question AnsweringMMBench CN (test)
Accuracy59.1
9
OCR EvaluationOCRBench (test)
Score9.7
9
Showing 9 of 9 rows

Other info

Follow for update