Think Outside the Policy: In-Context Steered Policy Optimization

About

Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts which are confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advanced models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces mixed-policy GRPO with implicit expert forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates expert region reject sampling to filter unreliable off-policy trajectories and annealed expert-bonus reward shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances RLVR performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs. Our code is available at https://github.com/Celine-hxy/ICPO.

Hsiu-Yuan Huang, Chenming Tang, Weijie Liu, Clive Bai, Saiyong Yang, Yunfang Wu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy32.8	394
Mathematical Reasoning	AIME 2025	Accuracy28.9	378
Mathematical Reasoning	Minerva	Accuracy (Acc)45.6	146
Mathematical Reasoning	AMC 2023	Accuracy75.9	104
Mathematical Reasoning	AIME 2025	Pass@1 Accuracy43.7	79
Mathematical Reasoning	Minerva	Pass@151.5	78
Multi-task Language Understanding	MMLU-Pro	Accuracy47.6	64
Question Answering	GPQA Diamond	Accuracy34.3	45
Mathematical Reasoning	Olympiad	Pass@165.2	41
Mathematical Reasoning	MATH	Accuracy88.4	40

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord