The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis

About

Test-time scaling via explicit reasoning trajectories significantly boosts large language model (LLM) performance but often triggers overthinking. To explore this, we analyze reasoning through two lenses: Reasoning Length Dynamics, which reveals a compensatory trade-off between thinking and answer content length that eventually leads to thinking redundancy, and Reasoning Semantic Dynamics, which identifies semantic convergence and repetitive oscillations. These dynamics uncover an instance-specific Reasoning Completion Point (RCP), beyond which computation continues without further performance gain. Since the RCP varies across instances, we propose a Reasoning Completion Point Detector (RCPD), an inference-time early-exit method that identifies the RCP by monitoring the rank dynamics of termination tokens (e.g., </think>). Across AIME and GPQA benchmarks using Qwen3 and DeepSeek-R1, RCPD reduces token usage by up to 44% while preserving accuracy, offering a principled approach to efficient test-time scaling.

Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, Xueqi Cheng• 2025

Related benchmarks

Task	Dataset	Result
Scientific Reasoning	GPQA Diamond	Accuracy65.2	41
General Reasoning	Overall	Accuracy81.5	40
Mathematical Reasoning	GSM8K	Accuracy96.7	27
Mathematical Reasoning	MATH 500	Accuracy91.9	26
Mathematical Reasoning	AIME 24/25	Accuracy76	25

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord