Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models

About

Preference-based alignment is pivotal for training large reasoning models; however, standard methods like Direct Preference Optimization (DPO) typically treat all preference pairs uniformly, overlooking the evolving utility of training instances. This static approach often leads to inefficient or unstable optimization, as it wastes computation on trivial pairs with negligible gradients and suffers from noise induced by samples near uncertain decision boundaries. Facing these challenges, we propose SAGE (Stability-Aware Gradient Efficiency), a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates. Concretely, SAGE integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence with a fine-grained, stability-aware scoring function that prioritizes informative, confident errors while filtering out unstable samples. Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines, highlighting the critical role of policy-aware, stability-conscious data selection in reasoning alignment.

Hui Wu, Hengyi Cai, Jinman Zhao, Xinran Chen, Ziheng Li, Zhejun Zhao, Shuaiqiang Wang, Yuchen Li, Dawei Yin• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AMC 23	Accuracy70	198
Mathematical Reasoning	Minerva	--	138
Mathematical Reasoning	Olympiad	Accuracy45.5	137
Mathematical Reasoning	MATH 500	MATH 500 Accuracy82.8	106
Mathematical Reasoning	AIME 24	AIME 24 Accuracy33.3	84
Mathematical Reasoning	College	Accuracy45.14	67
Mathematical Reasoning	Gaokao	Accuracy71.4	51

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord