SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment

About

Direct Preference Optimization (DPO) is broadly utilized for aligning Large Language Models (LLMs) with human values because of its flexibility. Despite its effectiveness, it has been observed that the capability of DPO to generate human-preferred response is limited and the results of DPO are far from resilient. To address these limitations, in this paper we propose a novel Self-Guided Direct Preference Optimization algorithm, i.e., SGDPO, which incorporates a pilot term to steer the gradient flow during the optimization process, allowing for fine-grained control over the updates of chosen and rejected rewards. We provide a detailed theoretical analysis of our proposed method and elucidate its operational mechanism. Furthermore, we conduct comprehensive experiments on various models and benchmarks. The extensive experimental results demonstrate the consistency between the empirical results and our theoretical analysis and confirm the effectiveness of our proposed approach (up to 9.19% higher score).

Wenqiao Zhu, Ji Liu, Lulu Wang, Jun Wu, Yulun Zhang• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	--	836
Physical Commonsense Reasoning	PIQA	Accuracy81.07	696
Multi-turn Dialogue Evaluation	MT-Bench	Overall Score8.36	532
Mathematical Reasoning	GSM8K	EM61.11	123
Language Understanding	MMLU	MMLU Score70.69	70
LLM Alignment Evaluation	AlpacaEval 2.0 (test)	LC Win Rate28.22	51
Scientific Reasoning	ARC	Score86.41	29
Truthfulness Evaluation	TruthfulQA	Normalized Accuracy58.06	10

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord