Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency

About

Parallel thinking improves LLM reasoning through multi-path sampling and aggregation. In standard evaluations, due to a lack of sample-specific priors, all samples share a global budget chosen to maximize dataset accuracy. However, many samples reach their best accuracy with much smaller budgets, causing low budget utilization. This contradiction between system efficacy and sample efficiency constitutes the Overscaling Curse. In this paper, we first provide a formal analysis of the overscaling curse and quantify its prevalence and severity in real-world systems. To break it, we propose Latent Budget Predictor (LanBo), which probes model latent representations to predict sample-specific optimal budgets. LanBo significantly improves budget utilization while maintaining dataset accuracy. We further integrate LanBo into the full decoding pipeline, inspiring Pre-decoding Budget Adaptation (PreAda), a paradigm that allocates budgets before decoding to preserve decoding-time parallelization. LanBo substantially improves hardware-aware efficiency in latency and memory, demonstrating both its practical value and the promise of LanBo for efficient parallel decoding.

Yiming Wang, Zhuosheng Zhang, Rui Wang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME25
Accuracy68.28
37
General Knowledge ReasoningMMLU-Pro
Accuracy75.84
36
General Science Question AnsweringGPQA
Inference Latency (s)2.7
24
Mathematical ReasoningAMC
C_mem (Ratio)0.1
24
Mathematical ReasoningAIME24
C_mem Ratio22
24
Mathematical ReasoningMATH500
Inference Latency (s)1.2
24
Mathematical ReasoningAMC
Latency (s)1.7
24
Mathematical ReasoningAIME24
Latency (s)12.2
24
Mathematical ReasoningAIME25
Inference Latency (s)14.2
24
Multi-task Language UnderstandingMMLU-Pro
Latency (s)3.3
24
Showing 10 of 12 rows

Other info

Follow for update