Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
About
Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | Speedup Factor4.87 | 147 | |
| General speculative decoding performance | Mean (MT-Bench, HumanEval, GSM8K) | Average Acceptance Length (τ)6.52 | 112 | |
| Code Generation | HumanEval | Avg Acceptance Length (τ)7.23 | 20 | |
| Mathematical Reasoning | GSM8K | Average Acceptance Length (τ)6.97 | 20 | |
| Multi-turn dialogue | MT-Bench | Acceptance Length (τ)5.78 | 20 | |
| Summarization | X-SUM | Average Acceptance Length (τ)5.13 | 3 | |
| Machine Translation | WMT14 | Average Acceptance Length (tau)2.97 | 3 |