Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Provable and Practical In-Context Policy Optimization for Self-Improvement

About

We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.

Tianrun Yu, Yuxiao Yang, Zhaoyang Wang, Kaixiang Zhao, Porter Jenkins, Xuchao Zhang, Chetan Bansal, Huaxiu Yao, Weitong Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy47.3
882
Mathematical ReasoningMATH L5
Accuracy0.3171
90
Mathematical ReasoningMATH level 3
Accuracy51.9
12
Mathematical ReasoningAMC
Mean @1647.06
9
Mathematical ReasoningAIME 2024
Mean@16 Accuracy79.17
6
Mathematical ReasoningAIME 2024
Mean@1630.42
5
Mathematical ReasoningMATH
Mean@1654.71
5
Mathematical ReasoningHMMT
Mean@1643.12
4
Mathematical ReasoningAIME 2024
Mean@1630.42
4
Mathematical ReasoningAMC
Mean@1647.06
4
Showing 10 of 14 rows

Other info

Follow for update