A Long Way to Go: Investigating Length Correlations in RLHF

About

Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.

Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett• 2023

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM-PLUS	Accuracy71.5	90
Mathematical Reasoning	MATH500	Accuracy44.8	86
Reward Hacking Mitigation	Synthetic Goodhart 1.0 (Evaluation)	R_g3.85	10
Reward Hacking Mitigation	Excessive HH Harmless 1.0 (Evaluation)	Reference Error Rate16.8	10
Reward Hacking Mitigation	Length Bias OA Length 1.0 (Evaluation)	Dominance32	9

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord