Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Adversarial Training for Process Reward Models

About

Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing step-level supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data to novel errors. We introduce Adversarially Trained PRMs (\texttt{APRM}), where a Generator ($G$) learns to produce reasoning errors to deceive a PRM ($R$), while $R$ concurrently learns to detect them. This interaction yields progressively harder negatives for $R$, improving its robustness and generalization to novel errors without requiring manual step-level labels. Averaged across diverse mathematical reasoning benchmarks, \texttt{APRM} improves solver accuracy by $+3.4$ percentage points (pp) over the strongest PRM baseline. \texttt{APRM} achieves gains of $+5.3$ pp on out-of-distribution tasks.

Gurusha Juneja, Deepak Nathani, William Yang Wang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 25
Accuracy94.5
201
Math ReasoningAMC
Accuracy70.7
70
Math ReasoningJEEBench
Accuracy74.4
60
Math ReasoningOlympiadBench
Accuracy90.7
54
Math ReasoningMATH500
Accuracy91.4
41
Math ReasoningOlympiadB
Accuracy90.7
36
Mathematical ReasoningMATH500
Accuracy91.4
30
Mathematical ReasoningAIME 25
Accuracy94.5
26
Showing 8 of 8 rows

Other info

Follow for update