Zero-Shot Tokenizer Transfer

About

Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.

Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vuli\'c• 2024

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	Pass@148.2	1043
Mathematical Reasoning	GSM8K (test)	Accuracy53.2	954
Question Answering	ARC-E	Accuracy49.03	523
Physical Interaction Question Answering	PIQA	Accuracy74.5	415
Sentence Completion	HellaSwag	Accuracy48.9	364
Boolean Question Answering	BoolQ	Accuracy78.6	350
Multiple-choice Question Answering	ARC Easy	Accuracy72.4	257
Code Generation	MBPP	Pass@142.4	211
Commonsense Reasoning	CommonsenseQA	Accuracy75.3	136
Multiple-choice Question Answering	ARC Challenge	Acc46.4	133

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord