HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
About
We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Visual Question Answering | OmniMedVQA (test) | CT Accuracy70.3 | 29 | |
| Multimodal Dental Image Analysis | MMOral-Uni 1.0 (test) | Loc Score18 | 28 | |
| Open-ended VQA | MMOral-OPG | Teeth Accuracy30.64 | 12 | |
| Text-only Question Answering | PMC-MI-Bench | BLEU@411.8 | 10 | |
| Multi-choice Visual Question Answering | PMC-MI-Bench | Accuracy88 | 10 | |
| Visual Question Answering | PMC-MI-Bench (test) | BLEU@49.3 | 10 | |
| Visual Question Answering | PMC-MI-Bench single-image | BLEU@49.8 | 10 | |
| Medical Visual Question Answering | MMMU Med | BMS Score50 | 10 | |
| X-ray understanding | MIMIC-CXR uncertain as positive | Micro F1 (14 classes)25.5 | 9 | |
| X-ray understanding | MIMIC-CXR uncertain as negative | Micro F1 (14 classes)24.2 | 8 |