Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

About

With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this challenge, we devised the Med-GRIT-270k dataset. It comprises 270k question-and-answer pairs and spans eight distinct medical imaging modalities. Most importantly, it is the first dedicated to the biomedical domain and integrating refer and ground conversations. The key idea is to sample large-scale biomedical image-mask pairs from medical segmentation datasets and generate instruction datasets from text using chatGPT. Additionally, we introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning. Extensive experiments have corroborated the efficacy of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model. This holds significant reference value for the exploration and development of intelligent biomedical assistants.

Xiaoshuang Huang, Haifeng Huang, Lingdong Shen, Yehui Yang, Fangxin Shang, Junwei Liu, Jia Liu• 2024

Related benchmarks

TaskDatasetResultRank
Medical Image AnsweringMed-GRIT 30k (test)
mBMR52.17
5
Referring CaptioningMed-GRIT 30k (test)
SPICE55.23
5
Referring object classificationMed-GRIT 30k (test)
Recall65.33
5
Visual GroundingMed-GRIT 30k (test)
Recall@0.553.92
5
Medical Image AnsweringLLaVa-Med qa0.2k
mBMR10.55
2
Showing 5 of 5 rows

Other info

Code

Follow for update