Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Do Vision and Language Encoders Represent the World Similarly?

About

Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual, cross-domain caption matching and image classification. Code available at github.com/mayug/0-shot-llm-vision.

Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Mohamed El Amine Seddik, Karttikeya Mangalam, Noel E. O'Connor• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-100--
84
Image MatchingCOCO (val)
Matching Accuracy51.3
48
Image RetrievalCOCO (val)
Recall@566.7
28
Caption Matching and Retrievalnocaps (val)
Matching Accuracy67.3
26
3D-Text MatchingObjaverse-LVIS 1.0 (test)
CLIP Matching Accuracy6.6
15
3D-Text RetrievalObjaverse-LVIS 1.0 (test)
CLIP Top-5 Retrieval18
15
Caption Matching and RetrievalCOCO 2014 (val)
Matching Accuracy72.3
13
Image-to-Text RetrievalCOCO 27 (val)
Matching Accuracy72.8
13
Caption MatchingXTD-10 v1 (test)
Score (de)39.6
4
Representation AlignmentXTD-10 v1 (test)
Alignment Score (DE)62.7
2
Showing 10 of 11 rows

Other info

Code

Follow for update