Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator

About

Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.

Zhuotong Chen, Fang Liu, Xuan Zhu, Yanjun Qi, Mohammad Ghavamzadeh• 2025

Related benchmarks

Task	Dataset	Result
Optical Character Recognition	OCRBench	Score864	433
Mathematical Multimodal Reasoning	MathVerse	Accuracy45.6	259
Mathematical Multimodal Reasoning	MathVista	Accuracy68.3	258
Multimodal Math Reasoning	MathVision	Accuracy25.6	246
Multimodal Math Reasoning	WeMath	Accuracy34.6	211
Multimodal Reasoning	WeMath	Accuracy34.6	171
Visual Question Answering	SimpleVQA	Accuracy0.516	164
Multimodal Reasoning	MathVision	Accuracy25.6	162
Chart Understanding	ChartQA	Accuracy86.2	159
Multimodal Reasoning	LogicVista	Accuracy45.9	147

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord