SciBERT: A Pretrained Language Model for Scientific Text
About
Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Node Classification | Cora (test) | Mean Accuracy45 | 687 | |
| Link Prediction | Citeseer | -- | 146 | |
| Medical Question Answering | MedMCQA (test) | Accuracy39.2 | 134 | |
| Molecule Captioning | ChEBI-20 (test) | BLEU-40.113 | 107 | |
| Relation Extraction | CDR (test) | F1 Score65.1 | 92 | |
| Named Entity Recognition | BC5CDR (test) | Macro F1 (span-level)90.01 | 80 | |
| Question Answering | MedQA | Accuracy39.2 | 70 | |
| Relation Extraction | GDA (test) | F1 Score82.5 | 65 | |
| Named Entity Recognition | NCBI-disease (test) | -- | 40 | |
| Text-Molecule Retrieval | ChEBI-20 (test) | Hits@116.8 | 31 |