Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

About

Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9$\times$, showcasing strong generalization across models and tasks. Code is available at https://github.com/Danielement321/HiPrune.

Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy69.2
1165
Object Hallucination EvaluationPOPE
Accuracy73
935
Multimodal EvaluationMME
Score1.62e+3
557
Text-based Visual Question AnsweringTextVQA
Accuracy54.9
496
Visual Question AnsweringGQA
Accuracy53.6
374
Multimodal UnderstandingMMBench--
367
Visual Question AnsweringChartQA--
239
Chart Question AnsweringChartQA--
229
Diagram Question AnsweringAI2D
AI2D Accuracy70.27
196
Video UnderstandingVideoMME
Overall Score84.41
192
Showing 10 of 25 rows

Other info

Follow for update