Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis

About

Test-time scaling via explicit reasoning trajectories significantly boosts large language model (LLM) performance but often triggers overthinking. To explore this, we analyze reasoning through two lenses: Reasoning Length Dynamics, which reveals a compensatory trade-off between thinking and answer content length that eventually leads to thinking redundancy, and Reasoning Semantic Dynamics, which identifies semantic convergence and repetitive oscillations. These dynamics uncover an instance-specific Reasoning Completion Point (RCP), beyond which computation continues without further performance gain. Since the RCP varies across instances, we propose a Reasoning Completion Point Detector (RCPD), an inference-time early-exit method that identifies the RCP by monitoring the rank dynamics of termination tokens (e.g., </think>). Across AIME and GPQA benchmarks using Qwen3 and DeepSeek-R1, RCPD reduces token usage by up to 44% while preserving accuracy, offering a principled approach to efficient test-time scaling.

Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, Xueqi Cheng• 2025

Related benchmarks

TaskDatasetResultRank
General ReasoningOverall
Accuracy81.5
40
Scientific ReasoningGPQA Diamond
Accuracy65.2
27
Mathematical ReasoningGSM8K
Accuracy96.7
27
Mathematical ReasoningMATH 500
Accuracy91.9
26
Mathematical ReasoningAIME 24/25
Accuracy76
25
Showing 5 of 5 rows

Other info

Follow for update