An evaluation of GPT models for phenotype concept recognition

About

Objective: Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. Materials and Methods: The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. Results: Our results show that, with an appropriate setup, these models can achieve state of the art performance. The best run, using few-shot learning, achieved 0.58 macro F1 score on publication abstracts and 0.75 macro F1 score on clinical observations, the former being comparable with the state of the art, while the latter surpassing the current best in class tool. Conclusion: While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.

Tudor Groza, Harry Caufield, Dylan Gration, Gareth Baynam, Melissa A Haendel, Peter N Robinson, Christopher J Mungall, Justin T Reese• 2023

Related benchmarks

Task	Dataset	Result
Document-level phenotype concept recognition	BIOC-GS	Precision (P)4.42	12
Document-level phenotype concept recognition	GSC 2024	Precision18.25	12
Document-level phenotype concept recognition	ID-68	Precision20.04	12
Document-level phenotype concept recognition	Average BIOC-GS GSC-2024 ID-68	Average F1 Score10.53	12
Phenotype Concept Recognition	BIOC-GS (test)	Precision (P)2.91	12
Phenotype Concept Recognition	GSC 2024 (test)	Precision (P)15.31	12
Phenotype Concept Recognition	ID-68 (test)	Precision18.29	12

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord