AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification

About

The reasoning capabilities of advanced large language models (LLMs) like o1 have revolutionized artificial intelligence applications. Nevertheless, evaluating and optimizing complex reasoning processes remain significant challenges due to diverse policy distributions and the inherent limitations of human effort and accuracy. In this paper, we present AURORA, a novel automated framework for training universal process reward models (PRMs) using ensemble prompting and reverse verification. The framework employs a two-phase approach: First, it uses diverse prompting strategies and ensemble methods to perform automated annotation and evaluation of processes, ensuring robust assessments for reward learning. Second, it leverages practical reference answers for reverse verification, enhancing the model's ability to validate outputs and improving training accuracy. To assess the framework's performance, we extend beyond the existing ProcessBench benchmark by introducing UniversalBench, which evaluates reward predictions across full trajectories under diverse policy distribtion with long Chain-of-Thought (CoT) outputs. Experimental results demonstrate that AURORA enhances process evaluation accuracy, improves PRMs' accuracy for diverse policy distributions and long-CoT responses. The project will be open-sourced at https://auroraprm.github.io/. The Universal-PRM-7B is available at https://huggingface.co/infly/Universal-PRM-7B.

Xiaoyu Tan, Tianchu Yao, Chao Qu, Bin Li, Minghao Yang, Dakuan Lu, Haozhe Wang, Xihe Qiu, Wei Chu, Yinghui Xu, Yuan Qi• 2025

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA	Accuracy96.8	916
Mathematical Reasoning	MATH 500	Accuracy (Acc)75.2	600
Mathematical Reasoning	AIME 2024	Accuracy30	525
Mathematical Reasoning	GSM8K	Accuracy100	388
Reasoning	MATH	--	46
Mathematical Reasoning	Math ID GSM8k ProofNet	GSM8k Accuracy97.5	28
Question Answering	QA OOD StrQA SciQA	StrQA Accuracy87.8	28
Reasoning Question Answering	StrategyQA	Accuracy0.925	26
Step-level correctness assessment	ProofNet (test)	PR-AUC32.9	22
Step-level correctness assessment	MATH (test)	PR-AUC53.4	22

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord