Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

About

Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space, but risk dropping regions needed for accurate prediction. We propose ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the lightweight projection layer. This measures how small random perturbations affect the projected features and enables efficient approximation of each token's influence without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that ZOO-Prune consistently outperforms prior methods while pruning up to 94.4% of tokens without sacrificing accuracy. Our method also improves efficiency, reaching up to 2.30x faster end-to-end inference compared to the baseline.

Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Visual Question AnsweringVQA v2
Accuracy79.64
1429
Visual Question AnsweringGQA--
1425
Text-based Visual Question AnsweringTextVQA
Accuracy57.98
962
Multimodal UnderstandingMMBench
Accuracy62.89
847
Science Question AnsweringScienceQA
Accuracy69.16
791
Multimodal EvaluationMME--
727
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy77.34
712
Multi-discipline Multimodal UnderstandingMMMU
Accuracy37.11
363
Text-based Visual Question AnsweringTextVQA (val)
Accuracy57.87
276
Showing 10 of 24 rows

Other info

Follow for update