Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning

About

The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy-response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.

Nathana\"el Carraz Rakotonirina, Ren Pang, Neha Anna John, Michael Bohlke-Schneider, Momchil Hardalov• 2026

Related benchmarks

TaskDatasetResultRank
Long-context ReasoningLongBench v2--
48
ReasoningReasoning Benchmarks Average of MATH-500, AIME 24, AIME 25, GPQA Diamond, CommonsenseQA, LiveCodeBench, and LongBenchv2 Qwen3
Accuracy72.2
12
Commonsense ReasoningCommon sense QA
AUCOAA81.4
11
Mathematical ReasoningAIME 24
AUCOAA81.8
11
Mathematical ReasoningAIME 25
AUCOAA80
11
Science ReasoningGPQA Diamond
AUCOAA71.6
11
Mathematical ReasoningMATH 500
AUCOAA91.2
11
Code GenerationLiveCodeBench
AUCOAA93.6
11
Showing 8 of 8 rows

Other info

Follow for update