Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Discriminative Policy Optimization for Token-Level Reward Models

About

Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.

Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAMC
Accuracy (ACC)31.6
203
Mathematical ReasoningMinerva Math
Accuracy34.2
186
Mathematical ReasoningMATH 500
Accuracy (Acc)65.5
149
Mathematical ReasoningOlympiad Bench
Accuracy32.2
123
Mathematical ReasoningAIME 2024
Accuracy3.6
104
Mathematical ReasoningAMC'23 (test)
Accuracy50
60
Mathematical ReasoningAMC23 (test)
Pass@15
56
Mathematical ReasoningMATH 500
Pass@466.8
20
Mathematical ReasoningMinerva Math
Accuracy @426.5
20
Process-level Error LocalizationPROCESSBENCH
GSM8K Accuracy61
20
Showing 10 of 10 rows

Other info

Follow for update