VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

About

Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4\% of the original performance. Code is available at https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.

Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Haitao Mi, Dong Yu• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy85	2019
Visual Question Answering	VQA v2	Accuracy75.4	1429
Text-based Visual Question Answering	TextVQA	Accuracy55.6	962
Visual Question Answering	GQA	Accuracy58.3	524
Multimodal Understanding	MMBench CN	Accuracy55.7	254
Visual Grounding	RefCOCO+ (val)	Accuracy79	253
Visual Grounding	RefCOCO+ (testA)	Accuracy84.6	245
Visual Grounding	RefCOCO+ (testB)	Accuracy70.6	219
Science Question Answering	ScienceQA SQA-IMG	Accuracy69.1	186
Visual Grounding	RefCOCO (val)	Accuracy86.7	172

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord