Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

About

Iterative Direct Preference Optimization (DPO) has emerged as a widely used paradigm for aligning Large Language Models on reasoning tasks. Existing approaches typically rely on Best-of-N sampling ($N\geq8$) to mine positive trajectories from the distribution tail. In this work, we show that in mathematical reasoning, increasing $N$ yields diminishing returns while increasing verifier-induced false-positive risk and the distribution shift required for policy updates. To address this, we introduce PACE (Proximal Alignment via Corrective Exploration), a generation-based corrective framework that replaces exhaustive mining with low-budget exploration ($2\leq N\leq3$). Rather than searching for increasingly rare positive samples, PACE synthesizes high-fidelity preference pairs from failed explorations through corrective hindsight refinement and verification-guided filtering. Empirically, PACE matches or exceeds the performance of DPO-R1 ($N=16$) while using about $1/5$ of the compute, and remains robust under 20\% label corruption, where high-$N$ baselines exhibit substantially higher noise exploitation.

Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Hejin Wang, Jiansheng Wei, Xiaojun Meng, Min Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAMC 23
Accuracy71.6
198
Mathematical ReasoningMinerva--
138
Mathematical ReasoningOlympiad
Accuracy50.8
137
Mathematical ReasoningCollege
Accuracy47.4
67
Mathematical ReasoningMATH500
Accuracy84.4
18
Out-of-domain GeneralizationOOD Suite BBH, HumanEval, MMLU, TruthfulQA
BBH Score59.1
4
Showing 6 of 6 rows

Other info

Follow for update