Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction

About

Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accuracy on VQA-RAD, outperforming the strongest baseline by 3.8 percentage points, and reaches 74.6% and 77.0% accuracy on SLAKE and VQA-2019, respectively. Under perturbed evaluation, Robust-MMR improves VQA-RAD accuracy from 69.1% to 75.6%. For image-text classification, cross-domain MELINDA accuracy increases from 70.3% to 75.2%, while retrieval experiments show a reduction in mean rank degradation from over 16 to 4.1 under perturbation. Qualitative results further demonstrate improved clinical reasoning for disease detection and structural abnormality assessment. These findings show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.

Melika Filvantorkaman, Mohsen Piri• 2026

Related benchmarks

TaskDatasetResultRank
Medical Visual Question AnsweringSlake
Accuracy73.9
134
Medical Visual Question AnsweringVQA-RAD
Accuracy80.4
106
Medical Visual Question AnsweringVQA-RAD (in-domain)
Accuracy83.3
10
Medical Visual Question AnsweringVQA-RAD cross-domain
Accuracy0.789
10
Medical Visual Question AnsweringVQA 2019 (test)
Overall Accuracy76.8
7
Medical Visual Question AnsweringSLAKE cross-domain
Accuracy74.6
6
Medical Visual Question AnsweringVQA cross-domain 2019
Accuracy77
6
Image-Caption RetrievalROCO
R@10 (Std)66.1
4
Image-Text ClassificationMELINDA Standard
Accuracy79.8
4
Image-Text ClassificationMELINDA Cross-Domain
Accuracy75.2
4
Showing 10 of 10 rows

Other info

Follow for update