Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unlocking Recursive Thinking of LLMs: Alignment via Refinement

About

The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose \textbf{AvR}: \textbf{Alignment via Refinement}, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize \textbf{refinement-aware rewards}. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20\% in win rate on AlpacaEval 2.0. Our code is available at Github (https://github.com/Banner-Z/AvR.git).

Haoke Zhang, Xiaobo Liang, Cunxiang Wang, Juntao Li, Min Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpacaEval 2.0
LC Win Rate51.4
281
Instruction FollowingArena Hard v0.1
Score34.5
16
Showing 2 of 2 rows

Other info

Code

Follow for update