RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

About

Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.

Linxuan Xia, Xiaolong Yang, Yongyuan Chen, Enyue Zhao, Deng Cai, Yasheng Wang, Boxi Wu• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	pass@194.8	239
Mathematical Reasoning	Minerva	Pass@132	138
Mathematical Reasoning	AMC	Pass@188.6	112
Mathematical Reasoning	AIME 2025	Pass@110	96
Mathematical Reasoning	AIME 2024	Pass@113.1	86
Mathematical Reasoning	Minerva	Pass@168.1	80
Mathematical Reasoning	Olympiad	Pass@168.1	50
Mathematical Reasoning	AIME25	Pass@172.5	48
General Knowledge	GPQA full	pass@161.8	4
General Capability	ARC-c OpenR1-Math Harder	Accuracy70.6	3

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord