Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
About
Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguality. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | ARC | Accuracy31.45 | 245 | |
| Natural Language Inference | XNLI | Accuracy40.72 | 131 | |
| Causal Reasoning | XCOPA | Accuracy62 | 55 | |
| Commonsense Reasoning | XStoryCloze | Average Score66.25 | 39 | |
| Reading Comprehension | Belebele c | Accuracy (Normalized)35.44 | 32 | |
| Coreference Resolution | XWinograd | Accuracy69.64 | 26 | |
| Paraphrase Identification | PAWS | Accuracy55.35 | 24 | |
| Multitask Language Understanding | GMMLU c | Acc (Normalized)30.75 | 22 | |
| Natural Language Inference | XNLI French | Accuracy49.04 | 18 | |
| Coreference Resolution | XWinograd French | Score65.06 | 18 |