Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning

About

The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy-response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.

Nathana\"el Carraz Rakotonirina, Ren Pang, Neha Anna John, Michael Bohlke-Schneider, Momchil Hardalov• 2026

Related benchmarks

Task	Dataset	Result
Long-context Reasoning	LongBench v2	--	88
Reasoning	Reasoning Benchmarks Average of MATH-500, AIME 24, AIME 25, GPQA Diamond, CommonsenseQA, LiveCodeBench, and LongBenchv2 Qwen3	Accuracy72.2	12
Commonsense Reasoning	Common sense QA	AUCOAA81.4	11
Mathematical Reasoning	AIME 24	AUCOAA81.8	11
Mathematical Reasoning	AIME 25	AUCOAA80	11
Science Reasoning	GPQA Diamond	AUCOAA71.6	11
Mathematical Reasoning	MATH 500	AUCOAA91.2	11
Code Generation	LiveCodeBench	AUCOAA93.6	11

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord