All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs

About

Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.

Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Xianfeng Tang, Hui Liu, Yuyin Zhou, Lianghua He• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy82.8	1117
Visual Question Answering	GQA	Accuracy60.1	963
Object Hallucination Evaluation	POPE	Accuracy87.1	935
Multimodal Evaluation	MME	Score1.85e+3	557
Multimodal Understanding	MMBench	Accuracy79.9	367
Science Question Answering	ScienceQA	Accuracy70.3	229
Science Question Answering	ScienceQA (SQA)	Accuracy70.6	128
Optical Character Recognition Benchmarking	OCRBench	Accuracy83.3	109
Visual Question Answering	DocVQA	Accuracy92.9	103
Multimodal Benchmarking	MMBench CN	Score57.7	73

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord