Med42-v2: A Suite of Clinical LLMs
About
Med42-v2 introduces a suite of clinical large language models (LLMs) designed to address the limitations of generic models in healthcare settings. These models are built on Llama3 architecture and fine-tuned using specialized clinical data. They underwent multi-stage preference alignment to effectively respond to natural prompts. While generic models are often preference-aligned to avoid answering clinical queries as a precaution, Med42-v2 is specifically trained to overcome this limitation, enabling its use in clinical settings. Med42-v2 models demonstrate superior performance compared to the original Llama3 models in both 8B and 70B parameter configurations and GPT-4 across various medical benchmarks. These LLMs are developed to understand clinical queries, perform reasoning tasks, and provide valuable assistance in clinical environments. The models are now publicly available at \href{https://huggingface.co/m42-health}{https://huggingface.co/m42-health}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Question Answering | MedMCQA | Accuracy62.28 | 346 | |
| Medical Question Answering | MedQA | Accuracy59.78 | 153 | |
| Question Answering | MedQA | Accuracy77.5 | 96 | |
| Medical Question Answering | PubMedQA | Accuracy78.1 | 92 | |
| Question Answering | MMLU | Accuracy60.5 | 46 | |
| Medical Reasoning | HealthBench Hard | Accuracy17.21 | 41 | |
| Health-related dialogue and decision-making | HealthBench Main | Average Score26.04 | 22 | |
| Medical order extraction | SIMORD (test) | Match Count65.2 | 22 | |
| Medical Data and Knowledge Processing | EHRStruct eICU | D-U1 Accuracy20 | 20 | |
| Data-Driven Structured EHR Understanding and Reasoning | Synthea | D-R2 Accuracy17 | 19 |