Me LLaMA: Foundation Large Language Models for Medical Applications

About

Recent advancements in large language models (LLMs) like ChatGPT and LLaMA show promise in medical applications, yet challenges remain in medical language comprehension. This study presents Me-LLaMA, a new medical LLM family based on open-source LLaMA models, optimized for medical text analysis and diagnosis by leveraging large-scale, domain-specific datasets. The Me-LLaMA family, including foundation models Me-LLaMA 13/70B and their chat-enhanced versions, was developed through continued pre-training and instruction tuning with 129B tokens and 214K samples from biomedical and clinical sources. Training the 70B models required over 100,000 A100 GPU hours. Me-LLaMA's performance was evaluated across six medical text analysis tasks using 12 benchmark datasets and complex clinical case diagnosis, with automatic and human evaluations. Results indicate Me-LLaMA outperforms LLaMA and other open-source medical LLMs in zero-shot and supervised settings. Task-specific tuning further boosts performance, surpassing ChatGPT on 7 of 8 datasets and GPT-4 on 5 of 8. For complex clinical cases, Me-LLaMA achieves performance comparable to ChatGPT and GPT-4. This work underscores the importance of domain-specific data in developing medical LLMs and addresses the high computational costs involved in training, highlighting a balance between pre-training and fine-tuning strategies. Me-LLaMA models are now accessible under user agreements, providing a valuable resource for advancing medical AI.

Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu, Fongci Lin, Xueqing Peng, Jimin Huang, Jeffrey Zhang, Vipina Keloth, Xinyu Zhou, Lingfei Qian, Huan He, Dennis Shung, Lucila Ohno-Machado, Yonghui Wu, Hua Xu, Jiang Bian• 2024

Related benchmarks

Task	Dataset	Result
Clinical Question Answering	MedMCQA	Accuracy71.1	14
Clinical Question Answering	GPQA Bio	Accuracy51.2	14
Clinical Question Answering	NEJM-MedQA	Accuracy44.2	14
Clinical Question Answering	MedQA	Accuracy58	14
Medical Question Answering	MedQA (M-QA)	Base Accuracy Std Dev2.28	13
Medical Question Answering	NEJM-MedQA	Base Deviation3.96	13
Fairness evaluation	EquityMedQA cross-population (test)	CDR (Race)5.4	8
Classification	CLIP 200 random samples (test)	Macro F1 Score0.192	6
Classification	MTS-Specialty 200 random samples (test)	Macro F1 Score8	6
Deployment Cost Analysis	General Queries	Peak Memory (GB)141.2	6

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord