Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

About

Reinforcement learning (RL) post-training has increasingly demonstrated strong ability to elicit reasoning behaviors in large language models (LLMs). For training efficiency, rollouts are typically generated in an off-policy manner using an older sampling policy and then used to update the current target policy. To correct the resulting discrepancy between the sampling and target policies, most existing RL objectives rely on a token-level importance sampling ratio, primarily due to its computational simplicity and numerical stability. However, we observe that token-level correction often leads to unstable training dynamics when the degree of off-policyness is large. In this paper, we revisit LLM policy optimization under off-policy conditions and show that the theoretically rigorous correction term is the prefix importance ratio, and that relaxing it to a token-level approximation can induce instability in RL post-training. To stabilize LLM optimization under large off-policy drift, we propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO). MinPRO replaces the unstable cumulative prefix ratio with a non-cumulative surrogate based on the minimum token-level ratio observed in the preceding prefix. Extensive experiments on both dense and mixture-of-experts LLMs, across multiple mathematical reasoning benchmarks, demonstrate that MinPRO substantially improves training stability and peak performance in off-policy regimes.

Shiye Lei, Zhihao Cheng, Dacheng Tao• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMinerva
Pass@133.6
138
Mathematical ReasoningGSM8K
pass@190.2
102
Mathematical ReasoningAIME 25
pass@133.4
65
Mathematical ReasoningOlympiad
Pass@158.8
50
Mathematical ReasoningAMC 23
Pass@187.5
46
Mathematical ReasoningMATH500
Pass@188.1
41
Mathematical ReasoningAIME24
Pass@147
23
Showing 7 of 7 rows

Other info

Follow for update