Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

About

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP utilize InfoNCE to maximize mutual information, primarily aligning pairwise samples across modalities while overlooking distributional differences. In addition, InfoNCE has inherent conflict in terms of alignment and uniformity in multimodality, leading to suboptimal alignment with modality gaps. To overcome the limitations, we propose CS-Aligner, a novel framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. CS-Aligner captures both the global distribution information of each modality and the pairwise semantic relationships. We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE, yielding tighter and more precise alignment. Moreover, by introducing distributional alignment, CS-Aligner enables incorporating additional information from unpaired data and token-level representations, enhancing flexible and fine-grained alignment in practice. Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.

Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen, Jan-Jakob Sonke, Efstratios Gavves• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	EuroSAT	Accuracy87.2	569
Image Classification	Flowers102	Accuracy97.5	558
Image Classification	UCF101	Top-1 Acc84	529
Image-to-Text Retrieval	Flickr30K 1K (test)	--	523
Image Classification	DTD	Accuracy72.3	487
Image Classification	SUN397	Accuracy76.2	450
Text-to-Image Retrieval	Flickr30K 1K (test)	--	436
Image Classification	StanfordCars	Accuracy81.9	384
Image Classification	ImageNet	Top-1 Accuracy72.9	366
Image Classification	OxfordPets	Accuracy93	298

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord