Med-Flamingo: a Multimodal Medical Few-shot Learner
About
Medicine, by its nature, is a multifaceted domain that requires the synthesis of information across various modalities. Medical generative vision-language models (VLMs) make a first step in this direction and promise many exciting clinical applications. However, existing models typically have to be fine-tuned on sizeable down-stream datasets, which poses a significant limitation as in many medical applications data is scarce, necessitating models that are capable of learning from few examples in real-time. Here we propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. Med-Flamingo unlocks few-shot generative medical visual question answering (VQA) abilities, which we evaluate on several datasets including a novel challenging open-ended VQA dataset of visual USMLE-style problems. Furthermore, we conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app. Med-Flamingo improves performance in generative medical VQA by up to 20\% in clinician's rating and firstly enables multimodal medical few-shot adaptations, such as rationale generation. We release our model, code, and evaluation app under https://github.com/snap-stanford/med-flamingo.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | Chest X-ray VQA (test) | Overall Accuracy43.64 | 43 | |
| Medical Visual Question Answering | SLAKE (test) | -- | 29 | |
| Medical Image Classification | Chest X-Ray (test) | Accuracy50.1 | 16 | |
| Medical Diagnosis | MAU (test) | DL Score27 | 13 | |
| Medical Visual Question Answering | VQA-RAD (test) | Accuracy55.8 | 13 | |
| Medical Visual Question Answering | PMC-VQA (test) | Accuracy34.7 | 13 | |
| Medical Visual Question Answering | PathVQA (test) | Accuracy40.7 | 13 | |
| Medical Visual Question Answering | MMMU Health & Medicine (test) | Accuracy47.5 | 12 | |
| Multi-image Medical Visual Question Answering | Med-MIM Held-in | Temporal (C)39.65 | 10 | |
| Multi-image Medical Visual Question Answering | MIM-ODIR (Held-out) | VQA Close Accuracy (C)16 | 10 |