Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

About

Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "reference policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we show theoretically that regularization to the reference policy can effectively prevent reward hacking. While the current practice in RLHF applies a KL penalty between action distributions for this purpose, our theory suggests regularizing the $\chi^2$ divergence between the policies' occupancy measures can be more effective. We intuitively show the benefits of this type of regularization and demonstrate that it better mitigates reward hacking in practice across four realistic settings, including RLHF. Our code is available at https://github.com/cassidylaidlaw/orpo.

Cassidy Laidlaw, Shivam Singhal, Anca Dragan• 2024

Related benchmarks

Task	Dataset	Result
Policy Optimization	Traffic	True Outcome16.91	8
Policy Optimization	Pandemic	True Performance-1.04	8
Policy Optimization	Glucose	True Outcome6	6
Policy Optimization	RLHF	True Score8.3	5
Reinforcement Learning	TOMATO	True Score6.28	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord