Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

About

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy66.7	525
Mathematical Reasoning	AIME 2025	Accuracy52.1	353
Mathematical Reasoning	HMMT 2025	Accuracy31.5	241
Mathematical Reasoning	AMC 2023	--	144
Mathematical Reasoning	Mathematical Reasoning Suite AIME24, AIME25, AMC23, HMMT24, HMMT25	AIME24 Score66.88	28
Mathematical Reasoning	AIME 2024	Pass@1 Accuracy45.2	16
Mathematical Reasoning	AIME 2025	Pass@1 Accuracy33.5	16
Mathematical Reasoning	HMMT 2024	Pass@1 Accuracy23.3	16

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord