Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

About

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar• 2024

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	Pass@134.1	1043
Mathematical Reasoning	GSM8K (test)	Accuracy66	954
Instruction Following	AlpacaEval 2.0	Win Rate2.24	722
Mathematical Reasoning	MATH 500	Top-1 Accuracy66.21	384
Mathematical Reasoning	AIME 2024	Pass@1 Accuracy7	236
Mathematical Reasoning	Omni-MATH	Accuracy10.84	123
Function Calling	BFCL Multi-Turn v3	Overall Accuracy27	69
Mathematical Problem Solving	AIME 2024	--	54
Mathematical Problem Solving	AIME 2025	--	46
Mathematics	AIME 2024	AIME 2024 Score (%)17.93	31

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord