Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Long Way to Go: Investigating Length Correlations in RLHF

About

Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.

Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett• 2023

Related benchmarks

TaskDatasetResultRank
Reward Hacking MitigationSynthetic Goodhart 1.0 (Evaluation)
R_g3.85
10
Reward Hacking MitigationExcessive HH Harmless 1.0 (Evaluation)
Reference Error Rate16.8
10
Reward Hacking MitigationLength Bias OA Length 1.0 (Evaluation)
Dominance32
9
Showing 3 of 3 rows

Other info

Follow for update