Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

About

Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images into multiple sub-images for separate encoding, but this approach drastically inflates the number of visual tokens and introduces prohibitive inference overhead. To overcome this challenge, we propose Pyramid Token Pruning (PTP), a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance. Inspired by human visual cognition, PTP selectively preserves more tokens from salient regions while further emphasizing those most relevant to task instructions. Extensive experiments on 13 diverse benchmarks show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.

Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.2	2019
Visual Question Answering	VizWiz	Accuracy33.1	1820
Visual Question Answering	GQA	Accuracy63.1	524
Visual Question Answering	AI2D	Accuracy83.8	317
Multimodal Evaluation	MM-Vet	Score56.9	196
Multimodal Evaluation	MMStar	Accuracy61.4	139
Comprehensive Multi-modal Evaluation	MME	Total Score2.22e+3	117
Multimodal Evaluation	SEED-Bench	Accuracy75.1	112
Real-world Visual Understanding	RealworldQA	Accuracy65.3	110
Multimodal Benchmark	MMBench (MMB)	Accuracy81.5	95

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord