Me LLaMA: Foundation Large Language Models for Medical Applications
About
Recent advancements in large language models (LLMs) like ChatGPT and LLaMA show promise in medical applications, yet challenges remain in medical language comprehension. This study presents Me-LLaMA, a new medical LLM family based on open-source LLaMA models, optimized for medical text analysis and diagnosis by leveraging large-scale, domain-specific datasets. The Me-LLaMA family, including foundation models Me-LLaMA 13/70B and their chat-enhanced versions, was developed through continued pre-training and instruction tuning with 129B tokens and 214K samples from biomedical and clinical sources. Training the 70B models required over 100,000 A100 GPU hours. Me-LLaMA's performance was evaluated across six medical text analysis tasks using 12 benchmark datasets and complex clinical case diagnosis, with automatic and human evaluations. Results indicate Me-LLaMA outperforms LLaMA and other open-source medical LLMs in zero-shot and supervised settings. Task-specific tuning further boosts performance, surpassing ChatGPT on 7 of 8 datasets and GPT-4 on 5 of 8. For complex clinical cases, Me-LLaMA achieves performance comparable to ChatGPT and GPT-4. This work underscores the importance of domain-specific data in developing medical LLMs and addresses the high computational costs involved in training, highlighting a balance between pre-training and fine-tuning strategies. Me-LLaMA models are now accessible under user agreements, providing a valuable resource for advancing medical AI.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Clinical Question Answering | MedMCQA | Accuracy71.1 | 14 | |
| Clinical Question Answering | GPQA Bio | Accuracy51.2 | 14 | |
| Clinical Question Answering | NEJM-MedQA | Accuracy44.2 | 14 | |
| Clinical Question Answering | MedQA | Accuracy58 | 14 | |
| Medical Question Answering | MedQA (M-QA) | Base Accuracy Std Dev2.28 | 13 | |
| Medical Question Answering | NEJM-MedQA | Base Deviation3.96 | 13 | |
| Fairness evaluation | EquityMedQA cross-population (test) | CDR (Race)5.4 | 8 | |
| Classification | CLIP 200 random samples (test) | Macro F1 Score0.192 | 6 | |
| Classification | MTS-Specialty 200 random samples (test) | Macro F1 Score8 | 6 | |
| Deployment Cost Analysis | General Queries | Peak Memory (GB)141.2 | 6 |