Self-Distillation for Multi-Token Prediction

About

As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.

Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Speedup2.972	46
Knowledge retrieval	TriviaQA	--	28
Language Modeling and Reasoning	Multi-benchmark Suite (AGIEval, GSM8K, MATH, Natural Questions, SimpleQA, TriviaQA, SuperGPQA) (cumulative)	AGIEval (EN)90.98	20
General Reasoning	AGIEval en	Speedup Ratio2.132	15
Knowledge retrieval	Natural Questions	Speedup Ratio5.803	15
Knowledge retrieval	Simple QA	Speedup Ratio3.884	15
Mathematical Reasoning	GSM8K	Speedup Ratio2.414	15
STEM Reasoning	Super GPQA	Speedup Ratio2.096	15

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord