Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

About

In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.

Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87.05
2019
Science Question AnsweringScienceQA
Accuracy71.15
791
Diagram Question AnsweringAI2D
AI2D Accuracy55.57
387
Multimodal Perception and CognitionMME
Overall Score1.79e+3
270
Optical Character Recognition EvaluationOCRBench
Score31.7
91
Visual Question Answering on TextTextVQA
Accuracy57.86
41
Aggregate Benchmark EvaluationOverall Performance Avg.
Overall Performance Ratio100.2
31
Comprehensive Multimodal BenchmarkingMMBench MMB-EN_en
Accuracy65.89
31
Visual Question AnsweringGQA
Accuracy61.79
31
Visual Question AnsweringGQA 40 (test)
TTFT (s)36.4
25
Showing 10 of 10 rows

Other info

Follow for update