Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
About
On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2024 | Accuracy66.7 | 479 | |
| Mathematical Reasoning | AIME 2025 | Accuracy52.1 | 311 | |
| Mathematical Reasoning | HMMT 2025 | Accuracy31.5 | 194 | |
| Mathematical Reasoning | AMC 2023 | -- | 144 | |
| Mathematical Reasoning | AIME 2024 | Pass@1 Accuracy45.2 | 16 | |
| Mathematical Reasoning | AIME 2025 | Pass@1 Accuracy33.5 | 16 | |
| Mathematical Reasoning | HMMT 2024 | Pass@1 Accuracy23.3 | 16 |