Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

About

Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "reference policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we show theoretically that regularization to the reference policy can effectively prevent reward hacking. While the current practice in RLHF applies a KL penalty between action distributions for this purpose, our theory suggests regularizing the $\chi^2$ divergence between the policies' occupancy measures can be more effective. We intuitively show the benefits of this type of regularization and demonstrate that it better mitigates reward hacking in practice across four realistic settings, including RLHF. Our code is available at https://github.com/cassidylaidlaw/orpo.

Cassidy Laidlaw, Shivam Singhal, Anca Dragan• 2024

Related benchmarks

TaskDatasetResultRank
Policy OptimizationTraffic
True Outcome16.91
8
Policy OptimizationPandemic
True Performance-1.04
8
Policy OptimizationGlucose
True Outcome6
6
Policy OptimizationRLHF
True Score8.3
5
Reinforcement LearningTOMATO
True Score6.28
3
Showing 5 of 5 rows

Other info

Follow for update