Biomedical Named Entity Recognition at Scale
About
Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Named Entity Recognition | BC5CDR (test) | Macro F1 (span-level)89.73 | 80 | |
| Named Entity Recognition | BC5CDR | F1 Score89.73 | 59 | |
| Named Entity Recognition | NCBI-disease (test) | -- | 40 | |
| Named Entity Recognition | NCBI-disease | F1 Score89.13 | 29 | |
| Named Entity Recognition | JNLPBA (test) | Macro F1 (span-level)81.29 | 23 | |
| Named Entity Recognition | AnatEM | F1 Score89.13 | 21 | |
| Named Entity Recognition | BC4CHEMD | F1 Score93.72 | 14 | |
| Named Entity Recognition | NBCI-Disease preprocessed (test) | Micro F1 (Excl. O)89.13 | 4 | |
| Named Entity Recognition | BC5CDR preprocessed (test) | Micro F1 (excl O)89.73 | 4 | |
| Named Entity Recognition | BC4CHEMD preprocessed (test) | Micro F1 (excl O)93.72 | 4 |