D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
About
Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward pass, enabling deeper drafters and longer accepted blocks. However, existing multi-token drafter objectives often use fixed position-dependent weighting schedules, such as head-dependent weights or block-position decays, which do not adapt as the positions limiting acceptance change during training. To address this, we derive per-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log-probability gradient contribution. The resulting loss, D-PACE (Dynamic Position-Aware Cross-Entropy), shifts training signal toward positions that currently limit acceptance as the drafter improves. Across six benchmarks, two Qwen3-4B draft depths, two decoding temperatures, and two additional target models, D-PACE consistently improves both wall-clock speedup and average emitted length, with 2.3\% measured training-time overhead and no changes to the drafter architecture or inference procedure.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | MT-Bench | -- | 287 | |
| Instruction Following | Alpaca | -- | 173 | |
| Code Generation | MBPP | -- | 79 | |
| Chat | MT-Bench | -- | 73 | |
| Chat | Alpaca | Success Rate (SR)1.79 | 16 | |
| Code Generation | HumanEval | SR3.81 | 16 | |
| Code Generation | MBPP | Success Rate (SR)3.53 | 16 | |
| Mathematics | GSM8K | Solve Rate (SR)3.91 | 16 | |
| Mathematics | MATH 500 | Success Rate (SR)4.47 | 16 | |
| Code Generation | HumanEval | Success Rate (SR)2.82 | 4 |