Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Perception-Aware Policy Optimization for Multimodal Reasoning

About

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji• 2025

Related benchmarks

TaskDatasetResultRank
Hallucination EvaluationHallusionBench--
93
Mathematical ReasoningWeMath
Accuracy39.5
75
Visual ReasoningBLINK
Accuracy52.66
50
Mathematical ReasoningMathVerse
Accuracy44.5
39
General Visual ReasoningMMStar
Accuracy45.8
29
Visual Logical ReasoningLogicVista
Accuracy45.8
28
Visual ReasoningMMVP
Accuracy68.67
19
Mathematical ReasoningMathVista MVistam
Accuracy71.6
18
Mathematical ReasoningDynaMath DMath
Accuracy54.7
18
Visual ReasoningMMMU Pro Vision
Accuracy38.7
18
Showing 10 of 17 rows

Other info

Follow for update