Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

About

Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

Jialuo He, Huangxun Chen• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.5	2019
Text-based Visual Question Answering	TextVQA	Accuracy58.1	962
Multimodal Evaluation	MME	Score1.51e+3	727
Multimodal Reasoning	MM-Vet	MM-Vet Score33.5	517
Multimodal Benchmarking	MMBench CN	Score59.1	151
Multimodal Benchmarking	MMBench	MMBench Score65.2	60
Visual Question Answering	GQA	Accuracy60.9	29
Multimodal Evaluation	SEED-Bench	SEED-Bench Score65.2	28

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord