Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment

About

Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.

Lukas Kuhn, Giuseppe Serra, Florian Buettner• 2026

Related benchmarks

TaskDatasetResultRank
Medical Image ClassificationChestX-ray14
Mean AUROC0.7625
18
Chest X-ray classificationMIMIC-CXR
AUC75.78
7
Showing 2 of 2 rows

Other info

Follow for update