Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

About

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy85.1
1455
Visual Question AnsweringTextVQA
Accuracy55.6
1285
Visual Question AnsweringGQA (test)
Accuracy43.9
188
Science Question AnsweringScienceQA SQA-IMG
Accuracy69.3
139
Visual Question AnsweringVizWiz (test)
Accuracy57.1
79
Object Hallucination EvaluationPOPE (test)
Accuracy67.3
79
Multimodal EvaluationMME
MME Score1.83e+3
73
Multi-modal Question AnsweringMMBench
Accuracy63.7
55
Multi-modal EvaluationMME (test)--
32
Multimodal Question AnsweringMMBench EN (test)
Accuracy59.5
26
Showing 10 of 14 rows

Other info

Follow for update