Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
About
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be released on https://github.com/Yzk1114/PGPO.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Multimodal Reasoning | MathVerse | Accuracy71.45 | 221 | |
| Multimodal Math Reasoning | MathVision | Accuracy29.02 | 183 | |
| Logical reasoning | LogicVista | Accuracy47.93 | 84 | |
| Mathematical Reasoning | DynaMath | Accuracy57.71 | 75 | |
| Geometric Reasoning | Geometry3K | Accuracy@145.2 | 42 | |
| Multimodal Mathematical Reasoning | MathVerse-V | Accuracy66.41 | 33 | |
| Multi-modal Reasoning | MMMU-Pro | Accuracy39.01 | 28 | |
| Multimodal Math Reasoning | MMK12 | Accuracy80.83 | 24 |