Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding

About

Speculative decoding accelerates large language model (LLM) inference by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, selecting an optimal speculation length is critical for maximizing speedup while minimizing wasted computation. We introduce \textit{GammaTune} and \textit{GammaTune+}, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates using a heuristic-based switching mechanism. Evaluated on SpecBench across multiple tasks and model pairs, our method outperforms other heuristic-based approaches and fixed-length speculative decoding, achieving an average speedup of 15\% ($\pm$5\%) with \textit{GammaTune} and 16\% ($\pm$3\%) with \textit{GammaTune+}, while reducing performance variance. This makes \textit{GammaTune} a robust and efficient solution for real-world deployment.

Aayush Gautam, Susav Shrestha, Narasimha Reddy• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Speed Up (x)3.32	246
Instruction Following	Alpaca	Speedup (x)3.51	173
Multi-turn conversation	MT-Bench	Speedup4.15	76
Question Answering	QA	Speedup Factor2.99	47
Multi-turn Conversation Evaluation	MT-Bench	Speedup3.43	25

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord