Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
About
Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | Pass@134.1 | 1036 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy66 | 900 | |
| Instruction Following | AlpacaEval 2.0 | Win Rate2.24 | 507 | |
| Mathematical Reasoning | AIME 2024 | Pass@1 Accuracy7 | 165 | |
| Mathematical Reasoning | MATH 500 | Top-1 Accuracy66.21 | 112 | |
| Mathematical Reasoning | Omni-MATH | Accuracy10.84 | 93 | |
| Function Calling | BFCL Multi-Turn v3 | Overall Accuracy27 | 41 | |
| Mathematics | AIME 2024 | AIME 2024 Score (%)17.93 | 31 | |
| Math | AIME 2025 | Top-1 Score16.81 | 26 | |
| Mathematical Problem Solving | AIME 2024 | -- | 26 |