PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
About
The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As in a cascading process. Significantly, to maximize PathMMU's authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, significantly smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multiple-Choice Questions | PathMMU PathCLS n = 177 (test-tiny) | Accuracy78.9 | 13 | |
| Multiple-choice Visual Question Answering | PathMMU SocialPath tiny n=229 (test) | Accuracy71.5 | 13 | |
| Visual Question Answering | PathMMU All-tiny (test) | Accuracy71.8 | 13 | |
| Multiple-choice Visual Question Answering | PathMMU PubMed (test-tiny) | Accuracy72.9 | 13 | |
| Multiple-choice Question Answering | PathMMU EduContent n=255 (test-tiny) | Accuracy69 | 13 | |
| Multiple-choice Question Answering | PathMMU Atlas tiny (test) | Accuracy68.3 | 13 |