Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

About

Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.

Changwoo Baek, Jouwon Song, Sohyeon Kim, Kyeongbo Kong• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87.4
2019
Visual Question AnsweringVizWiz
Accuracy56
1820
Visual Question AnsweringTextVQA
Accuracy57
1453
Visual Question AnsweringVQA v2
Accuracy76.4
1429
Multimodal EvaluationMME
Score1.75e+3
727
Diagram Question AnsweringAI2D--
387
Chart Question AnsweringChartQA--
371
Multimodal Perception and CognitionMME
Overall Score1.50e+3
270
Scientific Question AnsweringScienceQA image
Accuracy69
259
Visual Question AnsweringGQA
Mean Accuracy59.4
196
Showing 10 of 20 rows

Other info

GitHub

Follow for update