PulseMind: A Multi-Modal Medical Model for Real-World Clinical Diagnosis
About
Recent advances in medical multi-modal models focus on specialized image analysis like dermatology, pathology, or radiology. However, they do not fully capture the complexity of real-world clinical diagnostics, which involve heterogeneous inputs and require ongoing contextual understanding during patient-physician interactions. To bridge this gap, we introduce PulseMind, a new family of multi-modal diagnostic models that integrates a systematically curated dataset, a comprehensive evaluation benchmark, and a tailored training framework. Specifically, we first construct a diagnostic dataset, MediScope, which comprises 98,000 real-world multi-turn consultations and 601,500 medical images, spanning over 10 major clinical departments and more than 200 sub-specialties. Then, to better reflect the requirements of real-world clinical diagnosis, we develop the PulseMind Benchmark, a multi-turn diagnostic consultation benchmark with a four-dimensional evaluation protocol comprising proactiveness, accuracy, usefulness, and language quality. Finally, we design a training framework tailored for multi-modal clinical diagnostics, centered around a core component named Comparison-based Reinforcement Policy Optimization (CRPO). Compared to absolute score rewards, CRPO uses relative preference signals from multi-dimensional com-parisons to provide stable and human-aligned training guidance. Extensive experiments demonstrate that PulseMind achieves competitive performance on both the diagnostic consultation benchmark and public medical benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Question Answering | MedMCQA | Accuracy71.3 | 253 | |
| Medical Visual Question Answering | Slake | Accuracy85.6 | 134 | |
| Question Answering | MedQA | Accuracy94.8 | 70 | |
| Multi-modal Question Answering | MedXpertQA-MM | Accuracy36.7 | 27 | |
| Multi-modal Question Answering | MMMU Health & Medicine | Accuracy0.694 | 12 | |
| Multi-modal Question Answering | VQA-RAD | Accuracy87.1 | 12 | |
| Multi-modal Question Answering | PMC-VQA | Accuracy70.3 | 12 | |
| Multi-modal Question Answering | PathVQA | Accuracy64.9 | 12 | |
| Multi-modal Question Answering | DermaVQA | Accuracy42 | 12 | |
| Text-only Question Answering | MedXpertQA text | Accuracy29.8 | 12 |