Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

About

Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline -- surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.

Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME24
Accuracy63.75
48
Mathematical ReasoningAIME25
Accuracy46.77
48
Mathematical ReasoningAIME 2025
Weighted Majority Voting Accuracy66.72
27
Mathematical ReasoningAIME 2024
Weighted Accuracy75.52
27
Differential ExpressionPerturbQA (HepG2)
Accuracy46.86
6
Differential ExpressionPerturbQA Jurkat
Accuracy47.84
6
Differential ExpressionPerturbQA K562
Accuracy43.64
6
Differential ExpressionPerturbQA RPE1
Accuracy43.73
6
Degree of ChangePerturbQA (HepG2)
Accuracy69.7
6
Degree of ChangePerturbQA Jurkat
Accuracy57.61
6
Showing 10 of 12 rows

Other info

Follow for update