A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

About

Reinforcement learning (RL) post-training has increasingly demonstrated strong ability to elicit reasoning behaviors in large language models (LLMs). For training efficiency, rollouts are typically generated in an off-policy manner using an older sampling policy and then used to update the current target policy. To correct the resulting discrepancy between the sampling and target policies, most existing RL objectives rely on a token-level importance sampling ratio, primarily due to its computational simplicity and numerical stability. However, we observe that token-level correction often leads to unstable training dynamics when the degree of off-policyness is large. In this paper, we revisit LLM policy optimization under off-policy conditions and show that the theoretically rigorous correction term is the prefix importance ratio, and that relaxing it to a token-level approximation can induce instability in RL post-training. To stabilize LLM optimization under large off-policy drift, we propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO). MinPRO replaces the unstable cumulative prefix ratio with a non-cumulative surrogate based on the minimum token-level ratio observed in the preceding prefix. Extensive experiments on both dense and mixture-of-experts LLMs, across multiple mathematical reasoning benchmarks, demonstrate that MinPRO substantially improves training stability and peak performance in off-policy regimes.

Shiye Lei, Zhihao Cheng, Dacheng Tao• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	Minerva	Pass@133.6	138
Mathematical Reasoning	GSM8K	pass@190.2	102
Mathematical Reasoning	MATH500	Pass@188.1	77
Mathematical Reasoning	AIME 25	pass@133.4	65
Mathematical Reasoning	Olympiad	Pass@158.8	50
Mathematical Reasoning	AMC 23	Pass@187.5	46
Mathematical Reasoning	AIME24	Pass@147	38

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord