SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding
About
Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However, its adoption has been limited by the lack of high-quality draft models and scalable training infrastructure. We introduce SpecForge, an open-source, production-oriented framework for training speculative decoding models with full support for EAGLE-3. SpecForge incorporates target-draft decoupling, hybrid parallelism, optimized training kernels, and integration with production-grade inference engines, enabling up to 9.9x faster EAGLE-3 training for Qwen3-235B-A22B. In addition, we release SpecBundle, a suite of production-grade EAGLE-3 draft models trained with SpecForge for mainstream open-source LLMs. Through a systematic study of speculative decoding training recipes, SpecBundle addresses the scarcity of high-quality drafts in the community, and our draft models achieve up to 4.48x end-to-end inference speedup on SGLang, establishing SpecForge as a practical foundation for real-world speculative decoding deployment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | Alpaca | -- | 173 | |
| Question Answering | QA | -- | 47 | |
| Generative Inference | MT-Bench | Speedup2.61 | 44 | |
| Summarization | CNN/DM | -- | 32 | |
| Code Generation | HumanEval | TPS (Tokens/s)254.6 | 25 | |
| Multi-turn dialogue | MT-Bench | Tokens/s222 | 20 | |
| Coding | LiveCodeBench | Throughput3.41e+3 | 18 | |
| Coding | HumanEval | Throughput3.07e+3 | 18 | |
| Large Language Model Inference | GPQA | Throughput2.34e+3 | 18 | |
| Large Language Model Inference | FinanceQA | Throughput1.78e+3 | 18 |