Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

About

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy66.7
479
Mathematical ReasoningAIME 2025
Accuracy52.1
311
Mathematical ReasoningHMMT 2025
Accuracy31.5
194
Mathematical ReasoningAMC 2023--
144
Mathematical ReasoningAIME 2024
Pass@1 Accuracy45.2
16
Mathematical ReasoningAIME 2025
Pass@1 Accuracy33.5
16
Mathematical ReasoningHMMT 2024
Pass@1 Accuracy23.3
16
Showing 7 of 7 rows

Other info

Follow for update