Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

About

Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.

Jie Jiang, Xing Sun, Ruotian Chen, Jianan Su, Kaixin Shen• 2026

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	Speedup Factor4.87	147
General speculative decoding performance	Mean (MT-Bench, HumanEval, GSM8K)	Average Acceptance Length (τ)6.52	112
Code Generation	HumanEval	Avg Acceptance Length (τ)7.23	20
Mathematical Reasoning	GSM8K	Average Acceptance Length (τ)6.97	20
Multi-turn dialogue	MT-Bench	Acceptance Length (τ)5.78	20
Summarization	X-SUM	Average Acceptance Length (τ)5.13	3
Machine Translation	WMT14	Average Acceptance Length (tau)2.97	3

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord