Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

About

Vision-language pretraining has driven much of the recent progress in medical image representation learning, but this paradigm is constrained by the availability of paired image-text data and by the reporting bias of clinical narratives. We ask whether competitive radiology encoders can be learned without any language supervision. We introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture and pretrained on approximately 840K unlabeled chest X-ray images. The model learns to predict latent representations of masked target regions from a visible context region, an objective that differs from both image-text contrastive pretraining and DINO-style self-distillation by explicitly modelling conditional structure in representation space. We evaluate RadJEPA primarily on radiology report generation with a frozen Vicuna-7B decoder, and additionally substitute its encoder into four widely used vision-language backbones (MedLLaVA, Qwen-2.5, BLIP-2, and Phi-4). For completeness we also report disease classification and semantic segmentation results. Across two datasets and four metrics, RadJEPA matches or exceeds the strongest image-only and vision-language baselines while using a ViT-B/14 backbone at 224 x 224 resolution.

Anas Anwarul Haq Khan, Mariam Husain, Pratik Jalan, Kshitij Jadhav• 2026

Related benchmarks

TaskDatasetResultRank
Radiology Report GenerationMIMIC-CXR Findings generation v1.5 (Vicuna-7B)
ROUGE-L26.1
12
Radiology Report GenerationIU-Xray Findings generation v1.5 (Vicuna-7B)
ROUGE-L28.4
12
Image ClassificationVinDr-CXR (test)
AP (LO)19.2
10
Image ClassificationRSNA-Pneumonia 5,337 images (test)
AP72.7
10
Showing 4 of 4 rows

Other info

Follow for update