Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning
About
The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy-response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context Reasoning | LongBench v2 | -- | 48 | |
| Reasoning | Reasoning Benchmarks Average of MATH-500, AIME 24, AIME 25, GPQA Diamond, CommonsenseQA, LiveCodeBench, and LongBenchv2 Qwen3 | Accuracy72.2 | 12 | |
| Commonsense Reasoning | Common sense QA | AUCOAA81.4 | 11 | |
| Mathematical Reasoning | AIME 24 | AUCOAA81.8 | 11 | |
| Mathematical Reasoning | AIME 25 | AUCOAA80 | 11 | |
| Science Reasoning | GPQA Diamond | AUCOAA71.6 | 11 | |
| Mathematical Reasoning | MATH 500 | AUCOAA91.2 | 11 | |
| Code Generation | LiveCodeBench | AUCOAA93.6 | 11 |