Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Visually-Guided Policy Optimization for Multimodal Reasoning

About

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks. The code has been released at https://github.com/wzb-bupt/VGPO.

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningWeMath
Accuracy72.5
225
Mathematical ReasoningMathVista
Accuracy (%)74.1
29
Mathematical ReasoningMathVerse-V
Accuracy67.6
28
Multimodal ReasoningMMK12
Accuracy81.5
23
Logic reasoningLogicVista
LogicVista Accuracy49.4
16
Geometric ReasoningGeo3K
Accuracy45.8
13
Mathematical & Geometric ReasoningGeneral Mathematical & Geometric Reasoning Suite (MathVista, MathVerse, WeMath, MMK12, GeoMath, Geo3k)
MathVista Score74.1
12
Vision-dependent Multimodal ReasoningVision-dependent Multimodal Reasoning Suite (LogicVista, Counting, MMMU-Pro, MathVerseV)
LogicVista Score49.4
12
Showing 8 of 8 rows

Other info

Follow for update