Extreme Region Policy Distillation

About

Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse introduces distribution mismatch that existing trust-region techniques mitigate primarily by enforcing conservative optimization, often leaving rich training signals underutilized. To investigate this, we perform extensive off-policy updates on fixed data. Our experiments reveal that aggressive multi-step optimization brings rapid initial gains, but excessive updates cause trajectory probabilities to deviate and entropy to collapse, with performance plateauing early. Tightening KL constraints merely lowers the ceiling without resolving the degradation. This motivates Extreme Region Policy Distillation (ERPD), a two-stage framework that decouples sample efficiency from KL efficiency. The first stage performs weakly constrained off-policy optimization on fixed data to maximally extract training signals. The resulting policy provides token-level supervision. In the second stage, we distill these signals into the base policy under trust-region constraints, filtering harmful drift while preserving useful signals. The distilled policy achieves comparable or better performance with substantially smaller KL divergence, indicating that much of the first-stage divergence was spent on unnecessary drift rather than genuine improvement. Crucially, ERPD accommodates both strong and weak teachers: when aggressive optimization yields no stronger policy, even degenerate teachers provide effective supervision via alternative signal construction strategies. We validate ERPD on mathematical reasoning, showing gains for strong base models where on-policy training plateaus, and reliable improvements with weak teachers.

Changyu Chen, Xiting Wang, Rui Yan• 2026

Related benchmarks

Task	Dataset	Result
Code Generation	LiveCodeBench v6	Accuracy63.7	91
Mathematical Reasoning	IMO-Answer-Bench	Accuracy86.3	32
Mathematical Reasoning	HMMT Feb 25	AVG@K73.3	20
Mathematical Reasoning	Beyond AIME	AVG@K63.2	20
Mathematical Reasoning	HMMT Nov 25	AVG@K73.9	20
Mathematical Reasoning	HMMT Nov 25	Accuracy94.1	9
Mathematical Reasoning	HMMT Feb 25	Accuracy94.1	9
Mathematical Reasoning	HMMT Feb 26	Accuracy89.2	9

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord