PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
About
Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Multimodal Reasoning | MathVerse | Accuracy55 | 259 | |
| Multimodal Math Reasoning | MathVision | Accuracy44.8 | 246 | |
| Visual Mathematical Reasoning | MathVerse | Accuracy70.6 | 155 | |
| General Visual Understanding | RealworldQA | Accuracy70.7 | 62 | |
| General Visual Understanding | MMMU | Accuracy57.1 | 35 | |
| General Visual Understanding | MMMU-Pro | Accuracy50.7 | 30 | |
| General Visual Understanding | VisNumBench | Accuracy41.1 | 30 | |
| Hallucination Diagnosis | HallusionBench | -- | 15 | |
| Visual Math & Hallucination | MathVision | Accuracy51 | 5 | |
| Visual Math & Hallucination | HallusionBench | Accuracy76 | 5 |