How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach

About

Chain-of-thought prompting has emerged as a powerful technique for enabling large language models (LLMs) to solve complex reasoning tasks. However, these reasoning chains can be verbose, raising concerns about efficiency. In response, recent works have sought to decrease response lengths through simple prompting strategies (e.g. 'be concise'). In this work, we conduct the first systematic study of the relationship between reasoning length and model performance across a diverse range of compression instructions (e.g. 'use 10 words or less' or 'remove all punctuation'). In doing so, we discover a universal tradeoff between reasoning length and accuracy that persists across even very distinct reasoning chains. We demonstrate that this tradeoff emerges from a sharp threshold behavior at the question level: each task has an intrinsic 'token complexity' - a minimal number of tokens required for successful problem-solving. We show how token complexity enables us to compute information-theoretic limits on the accuracy-compression tradeoff, and find that prompt-based compression strategies operate far from these theoretical limits. This suggests there may be significant room for improvement and our framework provides a benchmark to help researchers evaluate progress in reasoning efficiency. Our work also highlights the importance of adaptive compression -- giving shorter responses for easier questions -- and we show that token complexity is a useful tool for measuring this capability.

Ayeong Lee, Ethan Che, Tianyi Peng• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	--	922
Mathematical Reasoning	AMC23 (test)	Pass@187.8	68
Mathematical Reasoning	GSM8K	Tokens Used456.9	33
Mathematical Reasoning	MATH500	Tokens2.14e+3	33
Mathematical Reasoning	GSM8K	Tokens692	30
Mathematical Reasoning	AIME 2024	ACC0.00e+0	26
Mathematical Reasoning	AMC 2023	Tokens1.33e+3	17
Mathematical Reasoning	MATH 500	Tokens Used1.27e+3	17
Mathematical Reasoning	MATH500	Accuracy-6.4	14
Mathematical Reasoning	GSM8K (test)	Pass@195.2	12

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord