Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs

About

Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.

Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Xianfeng Tang, Hui Liu, Yuyin Zhou, Lianghua He• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy82.8
1117
Visual Question AnsweringGQA
Accuracy60.1
963
Object Hallucination EvaluationPOPE
Accuracy87.1
935
Multimodal EvaluationMME
Score1.85e+3
557
Multimodal UnderstandingMMBench
Accuracy79.9
367
Science Question AnsweringScienceQA
Accuracy70.3
229
Science Question AnsweringScienceQA (SQA)
Accuracy70.6
128
Optical Character Recognition BenchmarkingOCRBench
Accuracy83.3
109
Visual Question AnsweringDocVQA
Accuracy92.9
103
Multimodal BenchmarkingMMBench CN
Score57.7
73
Showing 10 of 14 rows

Other info

Follow for update