Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Czert -- Czech BERT-like Model for Language Representation

About

This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.

Jakub Sido, Ond\v{r}ej Pra\v{z}\'ak, Pavel P\v{r}ib\'a\v{n}, Jan Pa\v{s}ek, Michal Sej\'ak, Miloslav Konop\'ik• 2021

Related benchmarks

TaskDatasetResultRank
Named Entity RecognitionCNEC 1.1
F1 Score86.27
20
Morphological TaggingPDT 3.5 (test)
POS Accuracy98.43
17
LemmatizationPDT 3.5 (test)
Lemmas Accuracy98.98
16
Named Entity RecognitionCNEC 2.0
F1 Score0.8533
16
Joint Morphological Tagging and LemmatizationPDT 3.5 (test)
Both Correct98.02
15
Morphosyntactic analysisUD 2.3
LAS93.13
15
Semantic ParsingPrague Tectogrammatical Graphs
Properties F192.69
11
Sentiment AnalysisCzech Facebook dataset
Macro F1 (10-fold)78.52
8
Morphosyntactic analysisPDT 3.5
POS Accuracy98.43
7
Dependency ParsingPDT 3.5
UAS93.57
7
Showing 10 of 12 rows

Other info

Follow for update