Unlocking Recursive Thinking of LLMs: Alignment via Refinement

About

The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose \textbf{AvR}: \textbf{Alignment via Refinement}, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize \textbf{refinement-aware rewards}. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20\% in win rate on AlpacaEval 2.0. Our code is available at Github (https://github.com/Banner-Z/AvR.git).

Haoke Zhang, Xiaobo Liang, Cunxiang Wang, Juntao Li, Min Zhang• 2025

Related benchmarks

Task	Dataset	Result	Rank
Instruction Following	AlpacaEval 2.0	Win Rate51		722
Instruction Following	Arena Hard v0.1	Score34.5		16

Showing 2 of 2 rows

Other info

Code

Follow for update

@wizwand_team Discord