GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

About

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He• 2024

Related benchmarks

Task	Dataset	Result
Medical Visual Question Answering	VQA-RAD	Accuracy66.3	228
Medical Report Generation	MIMIC-CXR (test)	ROUGE-L0.1415	100
Medical Visual Question Answering	PathVQA	Accuracy39.8	80
Medical Visual Question Answering	SLAKE (test)	--	67
Medical Visual Question Answering	PathVQA (test)	Accuracy47.2	55
Medical Visual Question Answering	VQA-RAD (test)	--	50
Medical Visual Question Answering	OmniMedVQA	Accuracy88.5	48
Medical Visual Question Answering	MMMU Health & Medicine (test)	Accuracy51.2	39
Medical Visual Question Answering	PMC-VQA (test)	Accuracy52.3	36
Medical Visual Understanding	GMAI-MMBench	Accuracy61.74	18

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord