Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Spotlight on Token Perception for Multimodal Reinforcement Learning

About

While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy86.2
807
Visual Mathematical ReasoningMathVista
Accuracy76.6
278
Mathematical Multimodal ReasoningMathVerse
Accuracy70.95
221
Visual Mathematical ReasoningMathVision
Accuracy30.52
186
Multimodal Math ReasoningMathVision
Accuracy28.27
183
Mathematical ReasoningMathVision--
144
Multimodal ReasoningMMStar
Accuracy67.2
143
Visual Mathematical ReasoningMathVerse
Accuracy47.2
135
Visual Mathematical ReasoningWeMath
Accuracy43.81
127
Multimodal ReasoningMMMU-Pro
Accuracy39.65
107
Showing 10 of 42 rows

Other info

GitHub

Follow for update