HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
About
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMMU | Accuracy50.59 | 275 | |
| Medical Question Answering | MedMCQA | Accuracy63.6 | 253 | |
| Medical Visual Question Answering | Slake | Accuracy78.85 | 134 | |
| Medical Visual Question Answering | VQA-RAD | Accuracy74.26 | 106 | |
| Medical Visual Question Answering | PathVQA | Overall Accuracy66.72 | 86 | |
| Question Answering | MedQA | Accuracy57.4 | 70 | |
| Medical Visual Question Answering | SLAKE closed-end | Accuracy74.02 | 54 | |
| Medical Visual Question Answering | PMC-VQA | Accuracy53 | 44 | |
| Visual Question Answering | Chest X-ray VQA (test) | Overall Accuracy61.96 | 43 | |
| Visual Question Answering | SlideBench-VQA TCGA | Microscopy Score58.64 | 32 |