GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

About

In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.

Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.05	2056
Science Question Answering	ScienceQA	Accuracy71.15	916
Diagram Question Answering	AI2D	AI2D Accuracy55.57	509
Multimodal Perception and Cognition	MME	Overall Score1.79e+3	344
Optical Character Recognition Evaluation	OCRBench	Score31.7	91
Visual Question Answering on Text	TextVQA	Accuracy57.86	41
Aggregate Benchmark Evaluation	Overall Performance Avg.	Overall Performance Ratio100.2	31
Comprehensive Multimodal Benchmarking	MMBench MMB-EN_en	Accuracy65.89	31
Visual Question Answering	GQA	Accuracy61.79	31
Visual Question Answering	GQA 40 (test)	TTFT (s)36.4	25

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord