Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GottBERT: a pure German Language Model

About

Pre-trained language models have significantly advanced natural language processing (NLP), especially with the introduction of BERT and its optimized version, RoBERTa. While initial research focused on English, single-language models can be advantageous compared to multilingual ones in terms of pre-training effort, overall resource efficiency or downstream task performance. Despite the growing popularity of prompt-based LLMs, more compute-efficient BERT-like models remain highly relevant. In this work, we present the first German single-language RoBERTa model, GottBERT, pre-trained exclusively on the German portion of the OSCAR dataset. Additionally, we investigated the impact of filtering the OSCAR corpus. GottBERT was pre-trained using fairseq and standard hyperparameters. We evaluated its performance on two Named Entity Recognition (NER) tasks (Conll 2003 and GermEval 2014) and three text classification tasks (GermEval 2018 fine and coarse, and 10kGNAD) against existing German BERT models and two multilingual models. Performance was measured using the $F_{1}$ score and accuracy. The GottBERT base and large models showed competitive performance, with GottBERT leading among the base models in 4 of 6 tasks. Contrary to our expectation, the applied filtering did not significantly affect the results. To support the German NLP research community, we are releasing the GottBERT models under the MIT license.

Raphael Scheible, Johann Frei, Fabian Thomczyk, Henry He, Patric Tippmann, Jochen Knaus, Victor Jaravine, Frank Kramer, Martin Boeker• 2020

Related benchmarks

TaskDatasetResultRank
Named Entity RecognitionCoNLL 2003 (test)
F1 Score87.19
556
Natural Language InferenceXNLI (test)--
167
Named Entity RecognitionGerman GermEval14 (test)
F1 Score88.83
21
Multilabel Political Party ClassificationBundestag and Wahl-O-Mat combined dataset 2024/2025 (test)
F1 Score74
13
Offensive Language IdentificationGermEval coarse 2018 (test)
F1 Score81.28
4
Topic Classification10kGNAD (test)
F1 Score90.97
2
Offensive Language IdentificationGermEval fine 2018 (test)
F1 Score57.37
2
Showing 7 of 7 rows

Other info

Follow for update