A General Theoretical Paradigm to Understand Learning from Human Preferences
About
The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy84.07 | 1891 | |
| Object Hallucination Evaluation | POPE | Accuracy88.71 | 1455 | |
| Code Generation | HumanEval | -- | 1036 | |
| Language Understanding | MMLU | Accuracy58.03 | 825 | |
| Reasoning | BBH | -- | 672 | |
| Instruction Following | IFEval | IFEval Accuracy77 | 625 | |
| Physical Commonsense Reasoning | PIQA | Accuracy78.94 | 572 | |
| Instruction Following | AlpacaEval 2.0 | Win Rate58.4 | 507 | |
| Multi-turn Dialogue Evaluation | MT-Bench | Overall Score8.17 | 447 | |
| Commonsense Reasoning | WinoGrande | Accuracy73.09 | 372 |