Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SimPO: Simple Preference Optimization with a Reference-Free Reward

About

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further improving the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including AlpacaEval 2, MT-Bench, and Arena-Hard. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Gemma-2-9B-it, achieves a 72.4% length-controlled win rate on AlpacaEval 2, a 59.1% win rate on Arena-Hard, and ranks 1st on Chatbot Arena among <10B models with real user votes.

Yu Meng, Mengzhou Xia, Danqi Chen• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy82.66
1891
Object Hallucination EvaluationPOPE
Accuracy87.81
1455
Mathematical ReasoningGSM8K
Accuracy61.22
1362
Code GenerationHumanEval--
1036
Language UnderstandingMMLU
Accuracy57.39
825
ReasoningBBH--
672
Instruction FollowingIFEval
IFEval Accuracy75
625
Mathematical ReasoningMATH500 (test)--
514
Instruction FollowingAlpacaEval 2.0
Win Rate46.4
507
Multi-turn Dialogue EvaluationMT-Bench
Overall Score7.2
447
Showing 10 of 117 rows
...

Other info

Follow for update