Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Perception-Aware Policy Optimization for Multimodal Reasoning

About

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal ReasoningMM-Vet
MM-Vet Score57.93
517
Multimodal UnderstandingMMStar
Accuracy61.8
407
Mathematical ReasoningMathVista
Accuracy67.53
382
Visual Mathematical ReasoningMathVista
Accuracy76.7
366
Mathematical Multimodal ReasoningMathVerse
Accuracy69.01
259
Visual Mathematical ReasoningMathVision
Accuracy27.2
254
Multimodal Math ReasoningMathVision
Accuracy29.93
246
Visual PerceptionBLINK
Accuracy52.66
241
Mathematical ReasoningWeMath
Accuracy71.1
225
Multimodal Math ReasoningWeMath
Accuracy43.52
211
Showing 10 of 120 rows
...

Other info

Follow for update