VRPRM: Process Reward Modeling via Visual Reasoning

About

Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought (CoT) capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM Supervised Fine-Tuning(SFT) data and 50K non-CoT PRM Reinforcement Learning (RL) training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118\% over the base model in the BoN experiment. This result confirms that the proposed combined training strategy can achieve higher quality reasoning capabilities at a lower data annotation cost, thus providing a new paradigm for PRM training with more efficient data utilization.

Xinquan Chen, Chongying Yue, Bangwei Liu, Xuhong Wang, Yingchun Wang, Chaochao Lu• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	WeMath	Accuracy51.43	199
Multimodal Reasoning	LogicVista	Accuracy84.78	172
Multimodal Reasoning	MathVision	Accuracy59.41	162
Multimodal Reasoning	DynaMath	--	77
Multimodal Reasoning	MathVista	Accuracy83.5	50
Multimodal Reasoning	HLE	Accuracy14.04	33
Multimodal Reasoning	MMMU	FEI Score55.06	20
Multimodal Reasoning	MathVision	FEI46.07	20
Multimodal Reasoning	VisualProcessBench Overall	FEI Average54.6	20
Multimodal Reasoning	MathVerse VO	FEI Score48.46	20

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord