Token Distillation: Attention-aware Input Embeddings For New Tokens

About

Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules. In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.

Konstantin Dobler, Desmond Elliott, Gerard de Melo• 2025

Related benchmarks

Task	Dataset	Result
Biomedical domain adaptation	Open Medical-LLM leaderboard	Macro Average71	84
Biomedical domain adaptation	Open Medical-LLM leaderboard macro-average	Macro Average Score71	75
Definition Generation	Biomedical domain tokens	Similarity Score83.8	75
Definition Generation	Multi-word tokens famous people, places, entities, sayings and concepts	Correctness86.3	66
Multiple-choice tasks	FrenchBench	Accuracy81.5	61
Machine Translation	Long FLORES uk to en (test)	BLEU21.92	14
General Knowledge	Global MMLU Ukrainian (test)	Accuracy (%)59.23	14
Reading Comprehension	Belebele Ukrainian (test)	Accuracy82.89	14
Abstractive Summarization	XL-Sum Ukrainian (test)	BLEU Score4.06	14
Machine Translation	WMT Ukrainian (test)	BLEU0.72	14

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord