Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

About

The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	Accuracy91.4	922
Mathematical Reasoning	MATH 500	Accuracy (Acc)77.6	600
Mathematical Reasoning	AIME 2024	Accuracy56.4	394
Mathematical Reasoning	AIME 24	Accuracy34.7	358
Mathematical Reasoning	AIME 2024 (test)	Accuracy73.3	294
Mathematical Reasoning	AMC	Accuracy (ACC)72.8	224
Mathematical Reasoning	GSM8K	Accuracy94.7	166
Mathematical Reasoning	Olympiad	Accuracy0.473	136
Mathematical Reasoning	AMC 2023	Accuracy83	104
Mathematical Reasoning	MATH 500	Average Tokens2.91e+3	104

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord