Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Topological Alignment of Shared Vision-Language Embedding Space

About

Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning. Code is available at https://github.com/junwon0/ToMCLIP.git.

Junwon You, Dasol Kang, Jae-Hun Jung• 2025

Related benchmarks

TaskDatasetResultRank
ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy (%)34.4
1163
ClassificationCIFAR-100 Multilingual (13 languages) (test)
Top-1 Accuracy66.18
20
Image RetrievalxFlickr&CO
Recall@162.98
10
Text RetrievalxFlickr&CO
Recall@163.79
10
Multilingual Image RetrievalxFlickr&CO Low resource (1% subset)
Recall@134.5
5
Multilingual Image RetrievalxFlickr&CO (Full data (2M samples))
Recall@150.85
5
Multilingual Text RetrievalxFlickr&CO Low resource (1% subset)
R@140.29
5
Multilingual Text RetrievalxFlickr&CO (Full data (2M samples))
Recall@154.07
5
Showing 8 of 8 rows

Other info

Follow for update