Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

R\'enyi Entropy: A New Token Pruning Metric for Vision Transformers

About

Vision Transformers (ViTs) achieve state-of-the-art performance but suffer from the $O(N^2)$ complexity of self-attention, making inference costly for high-resolution inputs. To address this bottleneck, token pruning has emerged as a critical technique to accelerate inference. Most existing methods rely on the [CLS] token to estimate patch importance. However, we argue that the [CLS] token can be unreliable in early layers where semantic representations are still immature. As a result, pruning in the early layer often leads to inaccurate importance estimation and unnecessary information loss. In this work, we propose a training-free token importance metric, namely Col-Ln, which is derived from R\'enyi entropy that enables the identification of informative tokens from the first layer of the network, thereby enabling more reliable pruning in token reduction. Extensive experiments on ViTs and Large Vision-Language Models (LVLMs) demonstrate that our approach consistently outperforms state-of-the-art pruning methods across diverse benchmarks.

Wei-Yuan Su, Ruijie Zhang, Zheng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationStanford Cars--
635
Image ClassificationEuroSAT--
569
Image ClassificationDTD--
542
Image ClassificationUCF101
Top-1 Acc69.34
455
Image ClassificationSUN397
Accuracy63.14
441
Multimodal UnderstandingLLaVA Evaluation Suite 1.5
Average Score63.2
95
Image ClassificationOxford Pets
Top-1 Acc85.99
94
Image ClassificationOxford Flowers
Top-1 Accuracy70.85
83
Image ClassificationImageNet OOD
ImageNet Acc65.31
68
Image ClassificationAircraft
Top-1 Acc22.98
57
Showing 10 of 12 rows

Other info

Follow for update