Multimodal LLM With Hierarchical Mixture-of-Experts for VQA on 3D Brain MRI
About
Multiparametric 3D brain MRI (mpMRI) is central to neuroradiology, but producing tumor location, appearance, size, and involvement of critical structures for neurosurgical planning remains challenging. We introduce mpLLM, a multimodal LLM for visual question answering (VQA) on mpMRI that produces clinically interpretable tumor descriptors (e.g., volume, morphology, extent, and coarse localization) as an adjunct to clinical expertise for referring neurosurgeons. mpLLM uses a prompt-conditioned hierarchical mixture-of-experts (MoE) to fuse multiple 3D sequences via routing over modality- and token-level projection experts, enabling data-efficient end-to-end training without large-scale image-report pretraining. To address limited paired image-text supervision, we propose a synthetic VQA protocol that derives clinically grounded questions and answers from expert segmentation annotations and is validated with radiologist collaboration. Across multiple mpMRI datasets, mpLLM improves over strong medical VLM baselines by +5.5 points on average (+9.1% relative) and increases radiologist-rated clinical acceptability by +15.9 points (+46.6% relative). Our study features three main contributions: (1) the first VQA dataset for 3D brain mpMRI, (2) a hierarchical MoE architecture for joint reasoning over interrelated 3D sequences, and (3) expert-supported evidence of clinical utility. Source code is available at https://github.com/arvindmvepa/mpllm, and we will release the dataset upon publication.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| VQA | GLI (test) | Volume62.5 | 6 | |
| VQA | GOAT (test) | Volume63 | 6 | |
| VQA | Met (test) | Volume Score65.7 | 6 | |
| Visual Question Answering | GLI (val) | Volume Score43 | 3 | |
| Classification | GLI (Primary Gliomas vs Secondary Metastatic Lesions) (val) | Accuracy95.6 | 2 | |
| Radiologist Acceptance | GLI (val) | Radiologist Acceptance Rate50 | 2 |