Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Towards Reward Fairness in RLHF: From a Resource Allocation Perspective

About

Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in rewards. We apply our methods in both verification and reinforcement learning scenarios to obtain a fairness reward model and a policy model, respectively. Experiments conducted in these scenarios demonstrate that our approach aligns LLMs with human preferences in a more fair manner.

Sheng Ouyang, Yulan Hu, Ge Chen, Qingyang Li, Fuzheng Zhang, Yong Liu• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpacaEval 2.0
LC Win Rate21.1
281
Instruction FollowingMT-Bench
MT-Bench Score7.24
189
Reward ModelingRewardBench
Avg Score78.38
118
Bias EvaluationBBQ
Accuracy88.6
99
Out-of-Domain (OOD) Bias EvaluationWinobias
Accuracy0.506
14
Stereotypical Bias MitigationUNQOVER
Accuracy94.9
14
Structural Bias EvaluationMNLI
Accuracy80.3
14
Out-of-Domain (OOD) Bias EvaluationStereoSet
Accuracy58.1
14
General Utility EvaluationMT_Bench
Agreement Rate45.4
14
Structural Bias EvaluationHANS
Accuracy54.2
14
Showing 10 of 12 rows

Other info

Code

Follow for update