Unlocking Multimodal Mathematical Reasoning via Process Reward Model

About

Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal Process-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found at https://github.com/URSA-MATH.

Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Lei Wang, Ruihang Chu, Jin Zeng, Yujiu Yang• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	Accuracy55.5	887
Multimodal Understanding	MMMU	Accuracy43.1	437
Chart Question Answering	ChartQA	Accuracy44.4	404
Multimodal Perception and Cognition	MME	Overall Score1.61e+3	344
Mathematical Reasoning	MathVerse	--	266
Multimodal Math Reasoning	MathVision	Accuracy26.2	263
Mathematical Multimodal Reasoning	MathVerse	Accuracy45.7	259
Multimodal Evaluation	MMStar	Accuracy42.3	177
Mathematical Reasoning	MathVision	Accuracy26.2	168
Multimodal Reasoning	MMStar	Accuracy55.4	143

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord