Provable and Practical In-Context Policy Optimization for Self-Improvement
About
We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH | Accuracy47.3 | 882 | |
| Mathematical Reasoning | MATH L5 | Accuracy0.3171 | 90 | |
| Mathematical Reasoning | MATH level 3 | Accuracy51.9 | 12 | |
| Mathematical Reasoning | AMC | Mean @1647.06 | 9 | |
| Mathematical Reasoning | AIME 2024 | Mean@16 Accuracy79.17 | 6 | |
| Mathematical Reasoning | AIME 2024 | Mean@1630.42 | 5 | |
| Mathematical Reasoning | MATH | Mean@1654.71 | 5 | |
| Mathematical Reasoning | HMMT | Mean@1643.12 | 4 | |
| Mathematical Reasoning | AIME 2024 | Mean@1630.42 | 4 | |
| Mathematical Reasoning | AMC | Mean@1647.06 | 4 |