Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning

About

Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.

Huimin Xu, Xin Mao, Feng-Lin Li, Xiaobao Wu, Wang Chen, Wei Zhang, Anh Tuan Luu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy89.3	1398
Mathematical Reasoning	MATH	Accuracy55.4	882
Mathematical Reasoning	SVAMP	Accuracy89.5	403
Mathematical Reasoning	ASDIV	Accuracy0.924	268
Mathematical Reasoning	GK 2023	Accuracy33.5	52
Mathematical Reasoning	ADDSUB	Solve Rate93.1	25
Mathematical Reasoning	GSM-ICM	Accuracy92.7	16
Mathematical Reasoning	OCW	Accuracy20.2	16
Mathematical Reasoning	GSM-IC2	Accuracy93.6	16

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord