Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Extreme Region Policy Distillation

About

Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse introduces distribution mismatch that existing trust-region techniques mitigate primarily by enforcing conservative optimization, often leaving rich training signals underutilized. To investigate this, we perform extensive off-policy updates on fixed data. Our experiments reveal that aggressive multi-step optimization brings rapid initial gains, but excessive updates cause trajectory probabilities to deviate and entropy to collapse, with performance plateauing early. Tightening KL constraints merely lowers the ceiling without resolving the degradation. This motivates Extreme Region Policy Distillation (ERPD), a two-stage framework that decouples sample efficiency from KL efficiency. The first stage performs weakly constrained off-policy optimization on fixed data to maximally extract training signals. The resulting policy provides token-level supervision. In the second stage, we distill these signals into the base policy under trust-region constraints, filtering harmful drift while preserving useful signals. The distilled policy achieves comparable or better performance with substantially smaller KL divergence, indicating that much of the first-stage divergence was spent on unnecessary drift rather than genuine improvement. Crucially, ERPD accommodates both strong and weak teachers: when aggressive optimization yields no stronger policy, even degenerate teachers provide effective supervision via alternative signal construction strategies. We validate ERPD on mathematical reasoning, showing gains for strong base models where on-policy training plateaus, and reliable improvements with weak teachers.

Changyu Chen, Xiting Wang, Rui Yan• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationLiveCodeBench v6
Accuracy63.7
75
Mathematical ReasoningIMO-Answer-Bench
Accuracy86.3
32
Mathematical ReasoningHMMT Feb 25
AVG@K73.3
20
Mathematical ReasoningBeyond AIME
AVG@K63.2
20
Mathematical ReasoningHMMT Nov 25
AVG@K73.9
20
Mathematical ReasoningHMMT Nov 25
Accuracy94.1
9
Mathematical ReasoningHMMT Feb 25
Accuracy94.1
9
Mathematical ReasoningHMMT Feb 26
Accuracy89.2
9
Showing 8 of 8 rows

Other info

Follow for update