PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

About

Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.

Hee Suk Yoon, Eunseop Yoon, Ji Woo Hong, SooHwan Eom, Gwanhyeong Koo, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Math Reasoning	MathVision	Accuracy44.8	263
Mathematical Multimodal Reasoning	MathVerse	Accuracy55	259
Visual Mathematical Reasoning	MathVerse	Accuracy70.6	194
Compositional Reasoning	SugarCrepe	Overall Accuracy81.7	95
General Visual Understanding	RealworldQA	Accuracy70.7	64
General Visual Understanding	MMMU	Accuracy57.1	35
Compositional Reasoning	Winoground	--	33
General Visual Understanding	MMMU-Pro	Accuracy50.7	30
General Visual Understanding	VisNumBench	Accuracy41.1	30
Compositional Reasoning	BIVLC	Accuracy86.5	16

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord