Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

About

The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang• 2024

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMMU
Accuracy50.59
275
Medical Question AnsweringMedMCQA
Accuracy63.6
253
Medical Visual Question AnsweringSlake
Accuracy78.85
134
Medical Visual Question AnsweringVQA-RAD
Accuracy74.26
106
Medical Visual Question AnsweringPathVQA
Overall Accuracy66.72
86
Question AnsweringMedQA
Accuracy57.4
70
Medical Visual Question AnsweringSLAKE closed-end
Accuracy74.02
54
Medical Visual Question AnsweringPMC-VQA
Accuracy53
44
Visual Question AnsweringChest X-ray VQA (test)
Overall Accuracy61.96
43
Visual Question AnsweringSlideBench-VQA TCGA
Microscopy Score58.64
32
Showing 10 of 83 rows
...

Other info

Follow for update