Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs

About

Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by \textbf{``vanishing token information''}, where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as ``information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9\% of Qwen-2.5-VL-7B performance while pruning 50\% of visual tokens. The code is available at https://github.com/YahongWang1/Information-Horizon.

Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Lianghua He, Xianfeng Tang, Hui Liu, Yuyin Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87.1
1455
Visual Question AnsweringTextVQA
Accuracy82.8
1285
Visual Question AnsweringGQA
Accuracy60.1
1249
Multimodal EvaluationMME
Score1.85e+3
658
Multimodal UnderstandingMMBench
Accuracy79.9
637
Science Question AnsweringScienceQA
Accuracy70.3
502
Science Question AnsweringScienceQA (SQA)
Accuracy70.6
273
Visual Question AnsweringDocVQA
Accuracy92.9
162
Visual Question AnsweringInfoVQA
Accuracy74.9
135
Optical Character Recognition BenchmarkingOCRBench
Accuracy83.3
131
Showing 10 of 14 rows

Other info

Follow for update