Pre-training Polish Transformer-based Language Models at Scale

About

Transformer-based language models are now widely used in Natural Language Processing (NLP). This statement is especially true for English language, in which many pre-trained models utilizing transformer-based architecture have been published in recent years. This has driven forward the state of the art for a variety of standard NLP tasks such as classification, regression, and sequence labeling, as well as text-to-text tasks, such as machine translation, question answering, or summarization. The situation have been different for low-resource languages, such as Polish, however. Although some transformer-based language models for Polish are available, none of them have come close to the scale, in terms of corpus size and the number of parameters, of the largest English-language models. In this study, we present two language models for Polish based on the popular BERT architecture. The larger model was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text. We describe our methodology for collecting the data, preparing the corpus, and pre-training the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements over previous approaches in eleven of them.

S{\l}awomir Dadas, Micha{\l} Pere{\l}kiewicz, Rafa{\l} Po\'swiata• 2020

Related benchmarks

Task	Dataset	Result
Binary Classification	IMDB	Accuracy94.36	21
Language Understanding	KLEJ 9 tasks (val)	KLEJ Score88.69	13
Financial Language Understanding	FinBench 7 tasks (val)	FinBench Score84.73	13
General Language Understanding	All tasks (25 tasks) (val)	Overall Accuracy85.7	13
Language Understanding	Other tasks (9 tasks) (val)	Other Tasks Score83.46	13
Long-context Language Understanding	Long tasks 4 tasks (val)	Long Tasks Score82.18	13
Multi-Label Classification	TwitterEMO	Weighted F170.7	3
Single-label Classification	8TAGS	Accuracy81.64	3
Single-label Classification	PPC	Accuracy89.96	3
General Polish Language Understanding	Average 25 Tasks	Average Score85.7	3

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord