OptPO: Optimal Rollout Allocation for Test-time Policy Optimization
About
Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | -- | 155 | |
| Mathematical Reasoning | AMC | Mean@1656.8 | 18 | |
| Mathematical Reasoning | AIME 2024 | Mean Score @1624 | 18 | |
| Question Answering | GPQA | Mean@1636.8 | 18 | |
| Expert-Level Question Answering | GPQA | Mean Score @1628.9 | 4 | |
| Mathematical Reasoning | AIME 2024 | Mean@1613.5 | 4 | |
| Mathematical Reasoning | AMC | Mean @1639.7 | 4 | |
| Mathematical Reasoning | MATH 500 | Mean Accuracy @1659 | 4 |