Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

About

Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images into multiple sub-images for separate encoding, but this approach drastically inflates the number of visual tokens and introduces prohibitive inference overhead. To overcome this challenge, we propose Pyramid Token Pruning (PTP), a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance. Inspired by human visual cognition, PTP selectively preserves more tokens from salient regions while further emphasizing those most relevant to task instructions. Extensive experiments on 13 diverse benchmarks show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.

Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy33.1
1043
Object Hallucination EvaluationPOPE
Accuracy87.2
935
Visual Question AnsweringGQA
Accuracy63.1
374
Visual Question AnsweringAI2D
Accuracy83.8
174
Multimodal EvaluationMM-Vet--
122
Multimodal EvaluationSEED-Bench
Accuracy75.1
80
Comprehensive Multi-modal EvaluationMME
Total Score2.22e+3
73
Multimodal BenchmarkMMBench (MMB)
Accuracy81.5
70
Multimodal EvaluationMMStar
Accuracy61.4
46
Real-world Visual UnderstandingRealworldQA
Accuracy65.3
24
Showing 10 of 13 rows

Other info

Follow for update