Adversarial Training for Process Reward Models

About

Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing step-level supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data to novel errors. We introduce Adversarially Trained PRMs (\texttt{APRM}), where a Generator ($G$) learns to produce reasoning errors to deceive a PRM ($R$), while $R$ concurrently learns to detect them. This interaction yields progressively harder negatives for $R$, improving its robustness and generalization to novel errors without requiring manual step-level labels. Averaged across diverse mathematical reasoning benchmarks, \texttt{APRM} improves solver accuracy by $+3.4$ percentage points (pp) over the strongest PRM baseline. \texttt{APRM} achieves gains of $+5.3$ pp on out-of-distribution tasks.

Gurusha Juneja, Deepak Nathani, William Yang Wang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 25	Accuracy94.5	201
Math Reasoning	AMC	Accuracy70.7	95
Math Reasoning	MATH500	Accuracy91.4	83
Math Reasoning	JEEBench	Accuracy74.4	82
Math Reasoning	OlympiadBench	Accuracy90.7	76
Mathematical Reasoning	MATH500	Accuracy91.4	76
Math Reasoning	OlympiadB	Accuracy90.7	36
Mathematical Reasoning	AIME 25	Accuracy94.5	26

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord