A General Theoretical Paradigm to Understand Learning from Human Preferences
About
The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy84.07 | 1460 | |
| Language Understanding | MMLU | Accuracy58.03 | 756 | |
| Reasoning | BBH | -- | 507 | |
| Multi-turn Dialogue Evaluation | MT-Bench | Overall Score8.17 | 331 | |
| Physical Commonsense Reasoning | PIQA | Accuracy78.94 | 329 | |
| Instruction Following | IFEval | -- | 292 | |
| Instruction Following | AlpacaEval 2.0 | LC Win Rate43.7 | 281 | |
| Commonsense Reasoning | WinoGrande | Accuracy73.09 | 231 | |
| Question Answering | ARC | Accuracy63.91 | 154 | |
| Mathematical Reasoning | GSM8K | EM64.06 | 115 |