RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization
About
Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | pass@194.8 | 153 | |
| Mathematical Reasoning | Minerva | Pass@132 | 138 | |
| Mathematical Reasoning | AMC | Pass@188.6 | 112 | |
| Mathematical Reasoning | AIME 2025 | Pass@110 | 96 | |
| Mathematical Reasoning | AIME 2024 | Pass@113.1 | 86 | |
| Mathematical Reasoning | Minerva | Pass@168.1 | 55 | |
| Mathematical Reasoning | Olympiad | Pass@168.1 | 50 | |
| Mathematical Reasoning | AIME25 | Pass@172.5 | 11 | |
| General Knowledge | GPQA full | pass@161.8 | 4 | |
| General Capability | ARC-c OpenR1-Math Harder | Accuracy70.6 | 3 |