On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency

About

Parallel thinking improves LLM reasoning through multi-path sampling and aggregation. In standard evaluations, due to a lack of sample-specific priors, all samples share a global budget chosen to maximize dataset accuracy. However, many samples reach their best accuracy with much smaller budgets, causing low budget utilization. This contradiction between system efficacy and sample efficiency constitutes the Overscaling Curse. In this paper, we first provide a formal analysis of the overscaling curse and quantify its prevalence and severity in real-world systems. To break it, we propose Latent Budget Predictor (LanBo), which probes model latent representations to predict sample-specific optimal budgets. LanBo significantly improves budget utilization while maintaining dataset accuracy. We further integrate LanBo into the full decoding pipeline, inspiring Pre-decoding Budget Adaptation (PreAda), a paradigm that allocates budgets before decoding to preserve decoding-time parallelization. LanBo substantially improves hardware-aware efficiency in latency and memory, demonstrating both its practical value and the promise of LanBo for efficient parallel decoding.

Yiming Wang, Zhuosheng Zhang, Rui Wang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME25	Accuracy68.28	37
General Knowledge Reasoning	MMLU-Pro	Accuracy75.84	36
General Science Question Answering	GPQA	Inference Latency (s)2.7	24
Mathematical Reasoning	AMC	C_mem (Ratio)0.1	24
Mathematical Reasoning	AIME24	C_mem Ratio22	24
Mathematical Reasoning	MATH500	Inference Latency (s)1.2	24
Mathematical Reasoning	AMC	Latency (s)1.7	24
Mathematical Reasoning	AIME24	Latency (s)12.2	24
Mathematical Reasoning	AIME25	Inference Latency (s)14.2	24
Multi-task Language Understanding	MMLU-Pro	Latency (s)3.3	24

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord