Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

About

DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.

Idan Pipano, Shoham Sabach, Kavosh Asadi, Mohammad Ghavamzadeh• 2026

Related benchmarks

Task	Dataset	Result	Rank
Multi-turn Chat Evaluation	MT-Bench (val)	Score7.924		3

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord