Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

About

Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal Process-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found at https://github.com/URSA-MATH.

Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Lei Wang, Ruihang Chu, Jin Zeng, Yujiu Yang• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench
Accuracy55.5
637
Multimodal UnderstandingMMMU
Accuracy43.1
437
Chart Question AnsweringChartQA
Accuracy44.4
356
Mathematical Multimodal ReasoningMathVerse
Accuracy45.7
221
Multimodal Math ReasoningMathVision
Accuracy26.2
183
Multimodal Perception and CognitionMME
Overall Score1.61e+3
182
Mathematical ReasoningMathVision
Accuracy26.2
144
Multimodal ReasoningMMStar
Accuracy55.4
143
Mathematical ReasoningMathVerse--
109
Multimodal Mathematical ReasoningMathVista mini
Accuracy0.598
90
Showing 10 of 17 rows

Other info

Follow for update