Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

About

Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

Jialuo He, Huangxun Chen• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy86.5
1455
Text-based Visual Question AnsweringTextVQA
Accuracy58.1
807
Multimodal EvaluationMME
Score1.51e+3
658
Multimodal ReasoningMM-Vet
MM-Vet Score33.5
431
Multimodal BenchmarkingMMBench CN
Score59.1
129
Visual Question AnsweringGQA
Accuracy60.9
29
Multimodal EvaluationSEED-Bench
SEED-Bench Score65.2
28
Multimodal BenchmarkingMMBench
MMBench Score65.2
13
Showing 8 of 8 rows

Other info

Follow for update