Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

About

Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9$\times$, showcasing strong generalization across models and tasks. Code is available at https://github.com/Danielement321/HiPrune.

Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy73
1455
Visual Question AnsweringVQA v2
Accuracy69.2
1362
Text-based Visual Question AnsweringTextVQA
Accuracy54.9
807
Multimodal EvaluationMME
Score1.62e+3
658
Multimodal UnderstandingMMBench--
637
Visual Question AnsweringGQA
Accuracy53.6
505
Visual Question AnsweringChartQA--
371
Chart Question AnsweringChartQA--
356
Document Visual Question AnsweringDocVQA
ANLS73.52
263
Diagram Question AnsweringAI2D
AI2D Accuracy70.27
232
Showing 10 of 28 rows

Other info

Follow for update