Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A New Massive Multilingual Dataset for High-Performance Language Technologies

About

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Ba\~n\'on, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ram\'irez-S\'anchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, J\"org Tiedemann• 2024

Related benchmarks

TaskDatasetResultRank
News Domain ClassificationCOUNT 19
Macro-F195.71
10
Sentiment ClassificationPSL–Kabaddi
Macro-F171.11
10
Offensive Language DetectionUSADC
Macro F10.9351
10
Sentiment ClassificationIMDB Urdu
Macro F189.69
10
Linguistic AcceptabilityURBLIMP
Aspect Accuracy98.3
10
Multilingual LLM EvaluationMultilingual Benchmarks 18 Languages
Reading Comprehension50.38
8
Showing 6 of 6 rows

Other info

Follow for update