Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

About

Direct Preference Optimization (DPO) has become a popular approach for aligning language models using pairwise preferences. However, in practical post-training pipelines, on-policy generation typically yields multiple candidate responses per prompt, which are scored by a reward model to guide learning. In this setting, we propose $\textbf{Multi-Preference Optimization (MPO)}$, a generalization of DPO that optimizes over entire sets of responses by extending the Bradley-Terry model to groupwise comparisons between chosen and rejected sets. To further enhance learning, MPO employs deviation-based weighting, which emphasizes outlier responses that deviate most from the mean reward, effectively inducing a self-paced curriculum. We theoretically prove that MPO reduces alignment bias at a rate of $\mathcal{O}\left(\frac{1}{\sqrt{n}}\right)$ with respect to the number of responses per query. Empirically, MPO achieves state-of-the-art performance on the UltraFeedback benchmark and yields up to $\sim 17.5\%$ improvement over the state-of-the-art baseline in length-controlled win rate on AlpacaEval2, establishing a new baseline for preference-based alignment

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, Saravan Rajmohan• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH-500 (test)	Accuracy84.2	46
Code Generation	APPS (test)	--	36
Video Captioning	VDC	Score2.35	10
Video Captioning	MSR-VTT (RCC)	Relevance7.31	10
Video Captioning	PE-Video (RCC)	Relevance7.57	10
Video Captioning	ARGUS	Cost-H0.572	10

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord