Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

About

Direct Preference Optimization (DPO) has become a popular approach for aligning language models using pairwise preferences. However, in practical post-training pipelines, on-policy generation typically yields multiple candidate responses per prompt, which are scored by a reward model to guide learning. In this setting, we propose $\textbf{Multi-Preference Optimization (MPO)}$, a generalization of DPO that optimizes over entire sets of responses by extending the Bradley-Terry model to groupwise comparisons between chosen and rejected sets. To further enhance learning, MPO employs deviation-based weighting, which emphasizes outlier responses that deviate most from the mean reward, effectively inducing a self-paced curriculum. We theoretically prove that MPO reduces alignment bias at a rate of $\mathcal{O}\left(\frac{1}{\sqrt{n}}\right)$ with respect to the number of responses per query. Empirically, MPO achieves state-of-the-art performance on the UltraFeedback benchmark and yields up to $\sim 17.5\%$ improvement over the state-of-the-art baseline in length-controlled win rate on AlpacaEval2, establishing a new baseline for preference-based alignment

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, Saravan Rajmohan• 2024

Related benchmarks

TaskDatasetResultRank
Video CaptioningVDC
Score2.35
10
Video CaptioningMSR-VTT (RCC)
Relevance7.31
10
Video CaptioningPE-Video (RCC)
Relevance7.57
10
Video CaptioningARGUS
Cost-H0.572
10
Showing 4 of 4 rows

Other info

Follow for update