Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Demystifying Long Chain-of-Thought Reasoning in LLMs

About

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringMedQA-USMLE (test)
Accuracy76.6
101
First-Order Logic ReasoningLogicNLI
Pass@176.6
18
Logical reasoningLogiQA
Pass@1 Accuracy0.88
18
First-Order Logic ReasoningFOLIO
Pass@1 Success Rate83.4
18
Inductive ReasoningCLUTRR
Pass@183.5
18
Deductive ReasoningPrOntoQA
Pass@10.956
18
Deductive ReasoningProofWriter
Pass@189.2
18
Logical reasoningLogical Deduction
Pass@10.876
18
Open Question AnsweringAIME 2025 (test)
Accuracy66.67
9
Closed Question AnsweringJAMA (test)
Accuracy55.4
9
Showing 10 of 13 rows

Other info

Follow for update