Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Perception-Aware Policy Optimization for Multimodal Reasoning

About

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMStar
Accuracy61.8
324
Visual Mathematical ReasoningMathVista
Accuracy76.7
278
Mathematical ReasoningMathVista
Accuracy67.53
257
Mathematical Multimodal ReasoningMathVerse
Accuracy69.01
221
Visual Mathematical ReasoningMathVision
Accuracy27.2
186
Multimodal Math ReasoningMathVision
Accuracy28.13
183
Multimodal Math ReasoningWeMath--
168
Mathematical ReasoningWeMath
Accuracy71.1
161
Mathematical ReasoningMathVision--
144
Multimodal ReasoningMMStar
Accuracy66.93
143
Showing 10 of 87 rows
...

Other info

Follow for update