Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

About

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy85.1
2019
Visual Question AnsweringTextVQA
Accuracy55.6
1453
Visual Question AnsweringVQA v2
Accuracy74.9
1429
Text-based Visual Question AnsweringTextVQA
Accuracy55.6
962
Multimodal UnderstandingMMBench
Accuracy63.7
847
Science Question AnsweringScienceQA
Accuracy69.3
791
Multimodal ReasoningMM-Vet
MM-Vet Score29.3
517
Multimodal UnderstandingSEED-Bench
Accuracy56.4
516
Multimodal UnderstandingMMBench CN
Accuracy56.6
254
Visual Question AnsweringGQA (test)
Accuracy43.9
197
Showing 10 of 38 rows

Other info

Follow for update