Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

About

As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a methodology for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. Building on the paradigm employed by Direct Preference Optimization (DPO) and its variants of treating preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. Additionally, MaPPO introduces no additional hyperparameters, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin for DPO variants, including widely used SimPO, IPO and CPO, and produce consistent improvements. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks (MT-Bench, AlpacaEval 2.0, and Arena-Hard) demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton• 2025

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy62.2
1896
Instruction FollowingIFEval
IFEval Accuracy82
836
Instruction FollowingAlpacaEval 2.0
Win Rate58.68
722
Multitask Language UnderstandingMMLU
Accuracy72.9
520
Instruction FollowingArena Hard
Win Rate89.8
263
Multitask Language UnderstandingMMLU
Accuracy63.5
263
Math Word Problem SolvingGSM8K
Accuracy82.4
158
Multi-turn dialogueMT-Bench
MT-Bench Score8.99
126
Multi-turn conversationMT-Bench
Average Score8.66
107
TruthfulnessTruthfulQA
Truthfulness Accuracy63.7
51
Showing 10 of 11 rows

Other info

Follow for update