Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

About

As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and matching or exceeding monolingual baselines for low-resource languages. However, we find that scale alone does not guarantee stability. Furthermore, for high-resource languages like French, we show that refining the decision boundary through third quartile sampling (Q3) or tuning the retention rate is necessary to fully leverage the multilingual signal.

Yassine Turki, Vinko Sabol\v{c}ec, Bettina Messmer, Martin Jaggi• 2026

Related benchmarks

TaskDatasetResultRank
ReasoningARC
Accuracy31.71
245
Natural Language InferenceXNLI
Accuracy41.89
131
Causal ReasoningXCOPA
Accuracy61.6
55
Commonsense ReasoningXStoryCloze
Average Score67.5
39
Commonsense ReasoningXCOPA
Accuracy60.8
35
Reading ComprehensionBelebele c
Accuracy (Normalized)37.11
32
Coreference ResolutionXWinograd
Accuracy69.25
26
Paraphrase IdentificationPAWS
Accuracy56.1
24
Multitask Language UnderstandingGMMLU c
Acc (Normalized)30.75
22
Coreference ResolutionXWinograd French
Score67.47
18
Showing 10 of 58 rows

Other info

Follow for update