Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Me LLaMA: Foundation Large Language Models for Medical Applications

About

Recent advancements in large language models (LLMs) like ChatGPT and LLaMA show promise in medical applications, yet challenges remain in medical language comprehension. This study presents Me-LLaMA, a new medical LLM family based on open-source LLaMA models, optimized for medical text analysis and diagnosis by leveraging large-scale, domain-specific datasets. The Me-LLaMA family, including foundation models Me-LLaMA 13/70B and their chat-enhanced versions, was developed through continued pre-training and instruction tuning with 129B tokens and 214K samples from biomedical and clinical sources. Training the 70B models required over 100,000 A100 GPU hours. Me-LLaMA's performance was evaluated across six medical text analysis tasks using 12 benchmark datasets and complex clinical case diagnosis, with automatic and human evaluations. Results indicate Me-LLaMA outperforms LLaMA and other open-source medical LLMs in zero-shot and supervised settings. Task-specific tuning further boosts performance, surpassing ChatGPT on 7 of 8 datasets and GPT-4 on 5 of 8. For complex clinical cases, Me-LLaMA achieves performance comparable to ChatGPT and GPT-4. This work underscores the importance of domain-specific data in developing medical LLMs and addresses the high computational costs involved in training, highlighting a balance between pre-training and fine-tuning strategies. Me-LLaMA models are now accessible under user agreements, providing a valuable resource for advancing medical AI.

Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu, Fongci Lin, Xueqing Peng, Jimin Huang, Jeffrey Zhang, Vipina Keloth, Xinyu Zhou, Lingfei Qian, Huan He, Dennis Shung, Lucila Ohno-Machado, Yonghui Wu, Hua Xu, Jiang Bian• 2024

Related benchmarks

TaskDatasetResultRank
Clinical Question AnsweringMedMCQA
Accuracy71.1
14
Clinical Question AnsweringGPQA Bio
Accuracy51.2
14
Clinical Question AnsweringNEJM-MedQA
Accuracy44.2
14
Clinical Question AnsweringMedQA
Accuracy58
14
Medical Question AnsweringMedQA (M-QA)
Base Accuracy Std Dev2.28
13
Medical Question AnsweringNEJM-MedQA
Base Deviation3.96
13
Fairness evaluationEquityMedQA cross-population (test)
CDR (Race)5.4
8
ClassificationCLIP 200 random samples (test)
Macro F1 Score0.192
6
ClassificationMTS-Specialty 200 random samples (test)
Macro F1 Score8
6
Deployment Cost AnalysisGeneral Queries
Peak Memory (GB)141.2
6
Showing 10 of 14 rows

Other info

Follow for update