EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography

About

Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.

Yuheng Li, Yue Zhang, Abdoul Aziz Amadou, Yuxiang Lai, Jike Zhong, Tiziano Passerini, Dorin Comaniciu, Puneet Sharma• 2025

Related benchmarks

Task	Dataset	Result
Cardiac ultrasound segmentation	CAMUS (test)	--	37
Segmentation	EchoNet-Dynamic (external)	Dice Coefficient93.1	15
View Classification	Multi-vendor TTE dataset (downstream)	Accuracy95.1	8
Disease Classification	EchoGround-MIMIC (test)	AUC86.5	7
Image-Text Retrieval	EchoGround-MIMIC (test)	Recall@50.0298	7
Segmentation	EchoNet-Pediatric	DSC (A4C)92.4	4
Landmark Detection	EchoNet-LVH (test)	IVS Average LE4.15	3

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord