VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

About

Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline -- surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.

Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME24	Accuracy63.75	48
Mathematical Reasoning	AIME25	Accuracy46.77	48
Mathematical Reasoning	AIME 2025	Weighted Majority Voting Accuracy66.72	27
Mathematical Reasoning	AIME 2024	Weighted Accuracy75.52	27
Differential Expression	PerturbQA (HepG2)	Accuracy46.86	6
Differential Expression	PerturbQA Jurkat	Accuracy47.84	6
Differential Expression	PerturbQA K562	Accuracy43.64	6
Differential Expression	PerturbQA RPE1	Accuracy43.73	6
Degree of Change	PerturbQA (HepG2)	Accuracy69.7	6
Degree of Change	PerturbQA Jurkat	Accuracy57.61	6

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord