Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Rectifying LLM Thought from Lens of Optimization

About

Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

Junnan Liu, Hongwei Liu, Songyang Zhang, Kai Chen• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
MATH 500 Accuracy83
106
Code ReasoningLiveCodeBench
Accuracy17.3
46
Code ReasoningMBPP
MBPP Execution Accuracy72.1
33
Mathematical ReasoningLMB (LiveMathBench)
Accuracy35.8
23
Mathematical ReasoningAIME 2025
Accuracy (avg@16)39
22
Code GenerationMBPP
Avg@4 (%)77.5
17
Code GenerationLiveCodeBench (LCB)
% Avg@432.4
17
Mathematical ReasoningMATH500
Avg@4 (%)94.1
17
Science ReasoningGPQA Diamond
Avg@4 Accuracy44.8
17
Mathematical ReasoningAIME 25
avg@16 Accuracy0.685
12
Showing 10 of 17 rows

Other info

GitHub

Follow for update