GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
About
In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy87.05 | 2019 | |
| Science Question Answering | ScienceQA | Accuracy71.15 | 791 | |
| Diagram Question Answering | AI2D | AI2D Accuracy55.57 | 387 | |
| Multimodal Perception and Cognition | MME | Overall Score1.79e+3 | 270 | |
| Optical Character Recognition Evaluation | OCRBench | Score31.7 | 91 | |
| Visual Question Answering on Text | TextVQA | Accuracy57.86 | 41 | |
| Aggregate Benchmark Evaluation | Overall Performance Avg. | Overall Performance Ratio100.2 | 31 | |
| Comprehensive Multimodal Benchmarking | MMBench MMB-EN_en | Accuracy65.89 | 31 | |
| Visual Question Answering | GQA | Accuracy61.79 | 31 | |
| Visual Question Answering | GQA 40 (test) | TTFT (s)36.4 | 25 |