Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

About

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguality. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.

Bettina Messmer, Vinko Sabol\v{c}ec, Martin Jaggi• 2025

Related benchmarks

Task	Dataset	Result
Reasoning	ARC	Accuracy31.45	269
Natural Language Inference	XNLI	Accuracy40.72	131
Causal Reasoning	XCOPA	Accuracy62	55
Commonsense Reasoning	XStoryCloze	Average Score66.25	39
Paraphrase Identification	PAWS	Accuracy55.35	35
Reading Comprehension	Belebele c	Accuracy (Normalized)35.44	32
Coreference Resolution	XWinograd	Accuracy69.64	26
Multitask Language Understanding	GMMLU c	Acc (Normalized)30.75	22
Natural Language Inference	XNLI French	Accuracy49.04	18
Coreference Resolution	XWinograd French	Score65.06	18

Showing 10 of 51 rows

Other info

Follow for update

@wizwand_team Discord