Rectifying LLM Thought from Lens of Optimization

About

Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

Junnan Liu, Hongwei Liu, Songyang Zhang, Kai Chen• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	MATH 500 Accuracy83	106
Code Reasoning	LiveCodeBench	Accuracy17.3	90
Code Reasoning	MBPP	MBPP Execution Accuracy72.1	33
Mathematical Reasoning	LMB (LiveMathBench)	Accuracy35.8	23
Mathematical Reasoning	AIME 2025	Accuracy (avg@16)39	22
Mathematical Reasoning	AIME 25	avg@16 Accuracy0.685	20
Code Generation	MBPP	Avg@4 (%)77.5	17
Code Generation	LiveCodeBench (LCB)	% Avg@432.4	17
Mathematical Reasoning	MATH500	Avg@4 (%)94.1	17
Science Reasoning	GPQA Diamond	Avg@4 Accuracy44.8	17

Showing 10 of 17 rows

Other info

GitHub

Follow for update

@wizwand_team Discord