RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model
About
We present RobeCzech, a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that RobeCzech considerably outperforms equally-sized multilingual and Czech-trained contextualized language representation models, surpasses current state of the art in all five evaluated NLP tasks and reaches state-of-the-art results in four of them. The RobeCzech model is released publicly at https://hdl.handle.net/11234/1-3691 and https://huggingface.co/ufal/robeczech-base.
Milan Straka, Jakub N\'aplava, Jana Strakov\'a, David Samuel• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Named Entity Recognition | CNEC 1.1 | F1 Score87.82 | 20 | |
| Morphological Tagging | PDT 3.5 (test) | POS Accuracy98.5 | 17 | |
| Lemmatization | PDT 3.5 (test) | Lemmas Accuracy99 | 16 | |
| Named Entity Recognition | CNEC 2.0 | F1 Score0.8749 | 16 | |
| Joint Morphological Tagging and Lemmatization | PDT 3.5 (test) | Both Correct98.11 | 15 | |
| Morphosyntactic analysis | UD 2.3 | LAS93.77 | 15 | |
| Semantic Parsing | Prague Tectogrammatical Graphs | Properties F193.58 | 11 | |
| Sentiment Analysis | Czech Facebook dataset | Macro F1 (10-fold)80.13 | 8 | |
| Dependency Parsing | PDT 3.5 | UAS94.14 | 7 | |
| Morphosyntactic analysis | PDT 3.5 | POS Accuracy98.5 | 7 |
Showing 10 of 12 rows