Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

About

Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous updates inevitably introduce policy lag and sample noise, which can cause the behavior to drift from near-policy toward off-policy. To counteract this without sacrificing efficiency, NPD integrates sparse student updates and the $\Delta$-IFD filtering mechanism, a heuristic sample selection mechanism that empirically stabilizes the optimization trajectory. By filtering extreme out-of-distribution samples, $\Delta$-IFD prevents noise from dominating the gradients, ensuring updates remain within a safe proximal learning zone. Empirically, the NPD framework achieves a 8.1x speedup over on-policy baselines and outperforms SFT by 8.09%. Crucially, by effectively narrowing the exploration space for subsequent RL, our method enables openPangu-Embedded-1B to reach a state-of-the-art score of 68.73%, outperforming the substantially larger Qwen3-1.7B. Codes will be released soon.

Miao Rang, Zhenni Bi, Hang Zhou, Kai Han, Xuechun Wang, An Xiao, Xinghao Chen, Yunhe Wang, Hanting Chen• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy (Acc)84.76
543
Language UnderstandingCMMLU
Accuracy56.53
62
Language UnderstandingCEval
Accuracy67.13
43
Reading ComprehensionDROP
F1 Score69.18
25
Instruction FollowingIF-Eval
Prompt Strict Accuracy65.43
22
Expert-Level ReasoningGPQA Diamond
Pass@1 Score50.51
14
Overall EvaluationAggregate
Average Score68.73
9
Winograd Schema ChallengeCLUEWSC
Accuracy82.87
9
Showing 8 of 8 rows

Other info

Follow for update