Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown

About

Reward models (RMs) are essential for aligning large language models (LLM) with human expectations. However, existing RMs struggle to capture the stochastic and uncertain nature of human preferences and fail to assess the reliability of reward predictions. To address these challenges, we introduce the Uncertainty-aware Reward Model (URM) and its ensemble variant, URME. URM employs a probabilistic value head to capture aleatoric uncertainty by modeling the distribution of disentangled human preference attributes. URME further quantifies epistemic uncertainty by examining discrepancies among individual URMs within the ensemble, enabling identification of unreliable evaluations. Our empirical evaluations demonstrate that URM achieves strong performance on RewardBench, outperforming competitive large-scale models. Additionally, extensive experiments, including best-of-n sampling (BoN), iterative direct preference optimization (iterative DPO), and proximal policy optimization (PPO), demonstrate that URM and URME significantly enhance LLMs' generation quality. Notably, reward predictions with lower uncertainty are far more reliable, demonstrate significantly higher quality, and result in substantially improved alignment.

Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, Junge Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench
Accuracy92.9
166
Reward ModelingRM-Bench
Accuracy72
125
Reward ModelingRMB
Accuracy65.7
120
Reward ModelingJudgeBench
Accuracy64.1
105
Reward ModelingRewardBench v2
Accuracy73.9
72
Reward ModelingPPE-Preference
Accuracy60.2
60
Reward ModelingAggregate of 7 benchmarks (HelpSteer3, Reward Bench V2, SCAN-HPD, HREF, LitBench, WQ_Arena, WPB)
Overall Accuracy67.68
45
Reward ModelingRewardBench v2 (test)
Average Score73.9
42
Reward ModelingPPE Correlation
Correlation60.4
40
Reward ModelingHelpSteer 3
Accuracy80.12
39
Showing 10 of 20 rows

Other info

Follow for update