Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

About

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar• 2024

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
Pass@134.1
850
Mathematical ReasoningGSM8K (test)
Accuracy66
797
Instruction FollowingAlpacaEval 2.0
LC Win Rate3.95
281
Function CallingBFCL Multi-Turn v3
Overall Accuracy27
41
Agentic PerformanceACEBench Agent
End-to-End Accuracy55
15
Mathematical ReasoningMATH (test)
Accuracy51.8
14
Reward model verificationHH-RLHF
Win Rate43.7
12
Reward MaximizationSHP
Win Rate0.453
12
Question AnsweringTruthfulQA
BLEU Accuracy42.4
2
Showing 9 of 9 rows

Other info

Follow for update