Self-Distillation for Multi-Token Prediction
About
As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH | Speedup2.972 | 42 | |
| Language Modeling and Reasoning | Multi-benchmark Suite (AGIEval, GSM8K, MATH, Natural Questions, SimpleQA, TriviaQA, SuperGPQA) (cumulative) | AGIEval (EN)90.98 | 20 | |
| General Reasoning | AGIEval en | Speedup Ratio2.132 | 15 | |
| Knowledge retrieval | Natural Questions | Speedup Ratio5.803 | 15 | |
| Knowledge retrieval | Simple QA | Speedup Ratio3.884 | 15 | |
| Knowledge retrieval | TriviaQA | Speedup Ratio4.765 | 15 | |
| Mathematical Reasoning | GSM8K | Speedup Ratio2.414 | 15 | |
| STEM Reasoning | Super GPQA | Speedup Ratio2.096 | 15 |