Perception-Aware Policy Optimization for Multimodal Reasoning

About

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	MM-Vet	MM-Vet Score57.93	517
Multimodal Understanding	MMStar	Accuracy61.8	407
Mathematical Reasoning	MathVista	Accuracy67.53	382
Visual Mathematical Reasoning	MathVista	Accuracy76.7	366
Mathematical Multimodal Reasoning	MathVerse	Accuracy69.01	259
Visual Mathematical Reasoning	MathVision	Accuracy27.2	254
Multimodal Math Reasoning	MathVision	Accuracy29.93	246
Visual Perception	BLINK	Accuracy52.66	241
Mathematical Reasoning	WeMath	Accuracy71.1	225
Multimodal Math Reasoning	WeMath	Accuracy43.52	211

Showing 10 of 120 rows

...

Other info

Follow for update

@wizwand_team Discord