Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning

About

Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.

Huimin Xu, Xin Mao, Feng-Lin Li, Xiaobao Wu, Wang Chen, Wei Zhang, Anh Tuan Luu• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy89.3
983
Mathematical ReasoningMATH
Accuracy55.4
643
Mathematical ReasoningSVAMP
Accuracy89.5
368
Mathematical ReasoningASDIV
Accuracy0.924
221
Mathematical ReasoningGK 2023
Accuracy33.5
52
Mathematical ReasoningADDSUB
Solve Rate93.1
22
Mathematical ReasoningGSM-ICM
Accuracy92.7
16
Mathematical ReasoningOCW
Accuracy20.2
16
Mathematical ReasoningGSM-IC2
Accuracy93.6
16
Showing 9 of 9 rows

Other info

Follow for update