Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

About

Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.

Landi He, Mingde Yao, Shawn Young, Lijian Xu• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Multimodal UnderstandingMMBench CN--
254
Multimodal EvaluationMME
Total Score1.72e+3
23
Multimodal Understanding and Question AnsweringLLaVA 7B Evaluation Suite (GQA, MMBench, MMBench-CN, MME, POPE, ScienceQA, VQAv2, TextVQA, SEED-Bench, VizWiz) 1.5
GQA Accuracy57.8
22
Science Question AnsweringScienceQA
SQA Score72.7
19
Vision-Language Multi-task EvaluationQwen2.5-VL Evaluation Suite MMB, MME, POPE, SQA, VQAText (test)
MMB Score81.7
10
Visual Question AnsweringGQA
GQA Score62.3
7
Multimodal UnderstandingMMBench
MMB Score64.7
7
Visual Question AnsweringVQAv2
VQAv2 Score78.6
7
Visual Question AnsweringTextVQA
VQAText Score56.7
7
Showing 10 of 10 rows

Other info

Follow for update