Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
About
Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous updates inevitably introduce policy lag and sample noise, which can cause the behavior to drift from near-policy toward off-policy. To counteract this without sacrificing efficiency, NPD integrates sparse student updates and the $\Delta$-IFD filtering mechanism, a heuristic sample selection mechanism that empirically stabilizes the optimization trajectory. By filtering extreme out-of-distribution samples, $\Delta$-IFD prevents noise from dominating the gradients, ensuring updates remain within a safe proximal learning zone. Empirically, the NPD framework achieves a 8.1x speedup over on-policy baselines and outperforms SFT by 8.09%. Crucially, by effectively narrowing the exploration space for subsequent RL, our method enables openPangu-Embedded-1B to reach a state-of-the-art score of 68.73%, outperforming the substantially larger Qwen3-1.7B. Codes will be released soon.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Accuracy (Acc)84.76 | 543 | |
| Language Understanding | CMMLU | Accuracy56.53 | 62 | |
| Language Understanding | CEval | Accuracy67.13 | 43 | |
| Reading Comprehension | DROP | F1 Score69.18 | 25 | |
| Instruction Following | IF-Eval | Prompt Strict Accuracy65.43 | 22 | |
| Expert-Level Reasoning | GPQA Diamond | Pass@1 Score50.51 | 14 | |
| Overall Evaluation | Aggregate | Average Score68.73 | 9 | |
| Winograd Schema Challenge | CLUEWSC | Accuracy82.87 | 9 |