Pre-training Polish Transformer-based Language Models at Scale
About
Transformer-based language models are now widely used in Natural Language Processing (NLP). This statement is especially true for English language, in which many pre-trained models utilizing transformer-based architecture have been published in recent years. This has driven forward the state of the art for a variety of standard NLP tasks such as classification, regression, and sequence labeling, as well as text-to-text tasks, such as machine translation, question answering, or summarization. The situation have been different for low-resource languages, such as Polish, however. Although some transformer-based language models for Polish are available, none of them have come close to the scale, in terms of corpus size and the number of parameters, of the largest English-language models. In this study, we present two language models for Polish based on the popular BERT architecture. The larger model was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text. We describe our methodology for collecting the data, preparing the corpus, and pre-training the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements over previous approaches in eleven of them.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Understanding | KLEJ 9 tasks (val) | KLEJ Score88.69 | 13 | |
| Financial Language Understanding | FinBench 7 tasks (val) | FinBench Score84.73 | 13 | |
| General Language Understanding | All tasks (25 tasks) (val) | Overall Accuracy85.7 | 13 | |
| Language Understanding | Other tasks (9 tasks) (val) | Other Tasks Score83.46 | 13 | |
| Long-context Language Understanding | Long tasks 4 tasks (val) | Long Tasks Score82.18 | 13 | |
| Binary Classification | IMDB | Accuracy94.36 | 9 | |
| Multi-Label Classification | TwitterEMO | Weighted F170.7 | 3 | |
| Single-label Classification | 8TAGS | Accuracy81.64 | 3 | |
| Single-label Classification | PPC | Accuracy89.96 | 3 | |
| General Polish Language Understanding | Average 25 Tasks | Average Score85.7 | 3 |