Model-Aware Tokenizer Transfer

About

Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.

Mykola Haltiuk, Aleksander Smywinski-Pohl• 2025

Related benchmarks

Task	Dataset	Result
Machine Translation	Long FLORES uk to en (test)	BLEU27.89	14
Machine Translation	WMT Ukrainian (test)	BLEU4.71	14
Reading Comprehension	Belebele Ukrainian (test)	Accuracy89.56	14
Abstractive Summarization	XL-Sum Ukrainian (test)	BLEU Score5.95	14
General Knowledge	Global MMLU Ukrainian (test)	Accuracy (%)64.98	14
Machine Translation	Long FLORES en to uk (test)	BLEU8.7	14
Multilingual Discriminative Language Understanding	Belebele, Global MMLU, and MMMLU Average across Arabic, German, Japanese, Swahili (mean of per-language values)	Belebele Accuracy66.39	8
Multilingual Text Generation	Long FLORES and XL-Sum Average across Arabic, German, Japanese, Swahili	Long FLORES en->x Score5.2	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord