Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

About

Iterative Direct Preference Optimization (DPO) has emerged as a widely used paradigm for aligning Large Language Models on reasoning tasks. Existing approaches typically rely on Best-of-N sampling ($N\geq8$) to mine positive trajectories from the distribution tail. In this work, we show that in mathematical reasoning, increasing $N$ yields diminishing returns while increasing verifier-induced false-positive risk and the distribution shift required for policy updates. To address this, we introduce PACE (Proximal Alignment via Corrective Exploration), a generation-based corrective framework that replaces exhaustive mining with low-budget exploration ($2\leq N\leq3$). Rather than searching for increasingly rare positive samples, PACE synthesizes high-fidelity preference pairs from failed explorations through corrective hindsight refinement and verification-guided filtering. Empirically, PACE matches or exceeds the performance of DPO-R1 ($N=16$) while using about $1/5$ of the compute, and remains robust under 20\% label corruption, where high-$N$ baselines exhibit substantially higher noise exploitation.

Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Hejin Wang, Jiansheng Wei, Xiaojun Meng, Min Zhang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AMC 23	Accuracy71.6	198
Mathematical Reasoning	Minerva	--	138
Mathematical Reasoning	Olympiad	Accuracy50.8	137
Mathematical Reasoning	College	Accuracy47.4	67
Mathematical Reasoning	MATH500	Accuracy84.4	18
Out-of-domain Generalization	OOD Suite BBH, HumanEval, MMLU, TruthfulQA	BBH Score59.1	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord