Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

About

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy85.1	2019
Visual Question Answering	TextVQA	Accuracy55.6	1453
Visual Question Answering	VQA v2	Accuracy74.9	1429
Text-based Visual Question Answering	TextVQA	Accuracy55.6	962
Multimodal Understanding	MMBench	Accuracy63.7	847
Science Question Answering	ScienceQA	Accuracy69.3	791
Multimodal Reasoning	MM-Vet	MM-Vet Score29.3	517
Multimodal Understanding	SEED-Bench	Accuracy56.4	516
Multimodal Understanding	MMBench CN	Accuracy56.6	254
Visual Question Answering	GQA (test)	Accuracy43.9	197

Showing 10 of 38 rows

Other info

Follow for update

@wizwand_team Discord