Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

About

While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be released on https://github.com/Yzk1114/PGPO.

Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical Multimodal ReasoningMathVerse
Accuracy71.45
221
Multimodal Math ReasoningMathVision
Accuracy29.02
183
Logical reasoningLogicVista
Accuracy47.93
84
Mathematical ReasoningDynaMath
Accuracy57.71
75
Geometric ReasoningGeometry3K
Accuracy@145.2
42
Multimodal Mathematical ReasoningMathVerse-V
Accuracy66.41
33
Multi-modal ReasoningMMMU-Pro
Accuracy39.01
28
Multimodal Math ReasoningMMK12
Accuracy80.83
24
Showing 8 of 8 rows

Other info

Follow for update