Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models
About
Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy86.5 | 1455 | |
| Text-based Visual Question Answering | TextVQA | Accuracy58.1 | 807 | |
| Multimodal Evaluation | MME | Score1.51e+3 | 658 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score33.5 | 431 | |
| Multimodal Benchmarking | MMBench CN | Score59.1 | 129 | |
| Visual Question Answering | GQA | Accuracy60.9 | 29 | |
| Multimodal Evaluation | SEED-Bench | SEED-Bench Score65.2 | 28 | |
| Multimodal Benchmarking | MMBench | MMBench Score65.2 | 13 |