A density estimation perspective on learning from pairwise human preferences

About

Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.

Vincent Dumoulin, Daniel D. Johnson, Pablo Samuel Castro, Hugo Larochelle, Yann Dauphin• 2023

Related benchmarks

Task	Dataset	Result
Imitation Learning	Dataset 3	MAE3.4	13
Imitation Learning	Dataset 5	MAE5.5	13
Imitation Learning	Dataset 2	MAE2.72	13
Inverse Reinforcement Learning	Dataset 2	MSE15.26	13
Imitation Learning	Dataset-1	MAE4.3	13
Imitation Learning	Dataset 4	MAE4.6	13
Inverse Reinforcement Learning	Dataset-1	MSE43.1	13
Inverse Reinforcement Learning	Dataset 3	MSE41.38	13
Inverse Reinforcement Learning	Dataset 4	MSE46.06	13
Inverse Reinforcement Learning	Dataset 5	MSE115.7	13

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord