RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model

About

We present RobeCzech, a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that RobeCzech considerably outperforms equally-sized multilingual and Czech-trained contextualized language representation models, surpasses current state of the art in all five evaluated NLP tasks and reaches state-of-the-art results in four of them. The RobeCzech model is released publicly at https://hdl.handle.net/11234/1-3691 and https://huggingface.co/ufal/robeczech-base.

Milan Straka, Jakub N\'aplava, Jana Strakov\'a, David Samuel• 2021

Related benchmarks

Task	Dataset	Result
Named Entity Recognition	CNEC 1.1	F1 Score87.82	20
Morphological Tagging	PDT 3.5 (test)	POS Accuracy98.5	17
Lemmatization	PDT 3.5 (test)	Lemmas Accuracy99	16
Named Entity Recognition	CNEC 2.0	F1 Score0.8749	16
Joint Morphological Tagging and Lemmatization	PDT 3.5 (test)	Both Correct98.11	15
Morphosyntactic analysis	UD 2.3	LAS93.77	15
Semantic Parsing	Prague Tectogrammatical Graphs	Properties F193.58	11
Sentiment Analysis	Czech Facebook dataset	Macro F1 (10-fold)80.13	8
Dependency Parsing	PDT 3.5	UAS94.14	7
Morphosyntactic analysis	PDT 3.5	POS Accuracy98.5	7

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord