Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

About

Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To bridge this gap, this paper introduces the first agent explicitly designed for the medical field, named \textbf{M}ulti-modal \textbf{Med}ical \textbf{Agent} (MMedAgent). We curate an instruction-tuning dataset comprising six medical tools solving seven tasks across five modalities, enabling the agent to choose the most suitable tools for a given task. Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o. Furthermore, MMedAgent exhibits efficiency in updating and integrating new medical tools. Codes and models are all available.

Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, Yixin Wang• 2024

Related benchmarks

TaskDatasetResultRank
Medical Visual Question AnsweringSlake
Accuracy68.7
247
Medical Visual Question AnsweringVQA-RAD
Accuracy66.1
228
Medical Visual Question AnsweringPathVQA
Overall Accuracy59.47
92
Open-ended VQAMMOral-OPG
Teeth Accuracy16
55
Medical Visual Question AnsweringMedXpertQA
Accuracy22.6
44
Medical Visual Question AnsweringMMMU Health & Medicine (test)
Accuracy58.9
39
Multimodal Dental Question AnsweringMMOral-Uni
II-Loc1.1
32
SegmentationBiomedParseData official (Dtest)
IoU36.13
13
SegmentationMeCOVQA-G+ (test)
IOU26.54
13
SegmentationHeld-out in-house (test)
IOU27.39
13
Showing 10 of 15 rows

Other info

Follow for update