Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Distillation for Multi-Token Prediction

About

As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.

Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Speedup2.972
42
Language Modeling and ReasoningMulti-benchmark Suite (AGIEval, GSM8K, MATH, Natural Questions, SimpleQA, TriviaQA, SuperGPQA) (cumulative)
AGIEval (EN)90.98
20
General ReasoningAGIEval en
Speedup Ratio2.132
15
Knowledge retrievalNatural Questions
Speedup Ratio5.803
15
Knowledge retrievalSimple QA
Speedup Ratio3.884
15
Knowledge retrievalTriviaQA
Speedup Ratio4.765
15
Mathematical ReasoningGSM8K
Speedup Ratio2.414
15
STEM ReasoningSuper GPQA
Speedup Ratio2.096
15
Showing 8 of 8 rows

Other info

Follow for update