Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SciBERT: A Pretrained Language Model for Scientific Text

About

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

Iz Beltagy, Kyle Lo, Arman Cohan• 2019

Related benchmarks

TaskDatasetResultRank
Node ClassificationCora (test)
Mean Accuracy45
687
Link PredictionCiteseer--
146
Medical Question AnsweringMedMCQA (test)
Accuracy39.2
134
Molecule CaptioningChEBI-20 (test)
BLEU-40.113
107
Relation ExtractionCDR (test)
F1 Score65.1
92
Named Entity RecognitionBC5CDR (test)
Macro F1 (span-level)90.01
80
Question AnsweringMedQA
Accuracy39.2
70
Relation ExtractionGDA (test)
F1 Score82.5
65
Named Entity RecognitionNCBI-disease (test)--
40
Text-Molecule RetrievalChEBI-20 (test)
Hits@116.8
31
Showing 10 of 131 rows
...

Other info

Code

Follow for update