Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Draft-OPD: On-Policy Distillation for Speculative Draft Models

About

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over $5\times$ lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

Haodi Lei, Yafu Li, Haoran Zhang, Shunkai Zhang, Qianjia Cheng, Xiaoye Qu, Ganqu Cui, Bowen Zhou, Ning Ding, Yun Luo, Yu Cheng• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
Speedup Factor6.02
147
Code GenerationMBPP
Speedup5.64
79
Mathematical ReasoningGSM8K
Average Length6.6
61
Multi-turn dialogueMT-Bench
Speedup3.18
44
Software EngineeringSWE-Bench Lite
Speedup4.66
36
Mathematical ReasoningAIME 25
Throughput (tokens/s)9.04e+3
30
MathematicsMATH 500
Throughput (tok/s)1.09e+4
30
Software EngineeringSWE Lite
Throughput (tok/s)1.05e+4
30
Mathematical ReasoningMATH 500
Speedup7.64
24
Mathematical ReasoningAIME 25
Speedup6.99
24
Showing 10 of 10 rows

Other info

GitHub

Follow for update