BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model

About

Pretrained language models have served as important backbones for natural language processing. Recently, in-domain pretraining has been shown to benefit various domain-specific downstream tasks. In the biomedical domain, natural language generation (NLG) tasks are of critical importance, while understudied. Approaching natural language understanding (NLU) tasks as NLG achieves satisfying performance in the general domain through constrained language generation or language prompting. We emphasize the lack of in-domain generative language models and the unsystematic generative downstream benchmarks in the biomedical domain, hindering the development of the research community. In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain. We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition. BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks. Furthermore, we conduct ablation studies on the pretraining tasks for BioBART and find that sentence permutation has negative effects on downstream tasks.

Hongyi Yuan, Zheng Yuan, Ruyi Gan, Jiaxing Zhang, Yutao Xie, Sheng Yu• 2022

Related benchmarks

Task	Dataset	Result
Named Entity Recognition	GENIA	F1 Score79.93	58
Biomedical Entity Linking	COMETA	Acc@181.8	20
Biomedical Entity Linking	NCBI	Acc@189.9	20
Biomedical Entity Linking	AAP	Accuracy@189.4	15
Biomedical Entity Linking	BC5CDR	Accuracy @193.5	15
Biomedical Entity Linking	MM-ST21pv	Acc@171.8	13
Named Entity Recognition	CADEC	F1 Score70.53	9
Dialogue System	Covid19-Dialogue (test)	BLEU12.05	5
Entity Linking	BC5CDR (test)	Recall@10.9326	5
Entity Linking	COMETA (test)	Recall@181.77	5

Showing 10 of 21 rows

Other info

Code

Follow for update

@wizwand_team Discord