Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Quantile Regression for Distributional Reward Models in RLHF

About

Reinforcement learning from human feedback (RLHF) has become a key method for aligning large language models (LLMs) with human preferences through the use of reward models. However, traditional reward models typically generate point estimates, which oversimplify the diversity and complexity of human values and preferences. In this paper, we introduce Quantile Reward Models (QRMs), a novel approach to reward modeling that learns a distribution over rewards instead of a single scalar value. Our method uses quantile regression to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This distributional approach can better capture the diversity of human values, addresses label noise, and accommodates conflicting preferences by modeling them as distinct modes in the distribution. Our experimental results show that QRM outperforms comparable traditional point-estimate models on RewardBench. Furthermore, we demonstrate that the additional information provided by the distributional estimates can be utilized in downstream applications, such as risk-aware reinforcement learning, resulting in LLM policies that generate fewer extremely negative responses. Our code and model are released at https://github.com/Nicolinho/QRM.

Nicolai Dorka• 2024

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench
Accuracy94.4
166
Reward ModelingRM-Bench
Accuracy72.8
125
Reward ModelingRMB
Accuracy64.7
120
Reward ModelingJudgeBench
Accuracy63.8
105
Reward ModelingRewardBench v2
Accuracy76.7
72
Reward ModelingPPE-Preference
Accuracy60.6
60
Reward ModelingRewardBench v2 (test)
Average Score76.7
42
Reward ModelingPPE Correlation
Correlation60.5
40
Role-playing Reward ModelingRoleRM-Bench
Average Score47.42
22
Reward ModelingPMDC Maximum Discrepancy samples
Rank2
10
Showing 10 of 10 rows

Other info

Follow for update