SciBERT: A Pretrained Language Model for Scientific Text

About

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

Iz Beltagy, Kyle Lo, Arman Cohan• 2019

Related benchmarks

Task	Dataset	Result
Node Classification	Cora (test)	Mean Accuracy45	951
Link Prediction	Citeseer	--	162
Medical Question Answering	MedMCQA (test)	Accuracy39.2	134
Molecule Captioning	ChEBI-20 (test)	METEOR0.367	114
Question Answering	MedQA	Accuracy39.2	96
Relation Extraction	CDR (test)	F1 Score65.1	92
Named Entity Recognition	BC5CDR (test)	Macro F1 (span-level)90.01	80
Relation Extraction	GDA (test)	F1 Score82.5	65
Named Entity Recognition	NCBI-disease (test)	--	40
Named Entity Recognition	JNLPBA (test)	Macro F1 (span-level)77.28	32

Showing 10 of 145 rows

...

Other info

Code

Follow for update

@wizwand_team Discord