Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Is CLIP ideal? No. Can we fix it? Yes!

About

Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP

Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationFlowers102
Accuracy52.9
558
Image ClassificationCaltech101
Accuracy79.2
228
Negation UnderstandingNeg-COCO MCQ
Accuracy48.6
14
Spatial ReasoningWhat’sUp
Accuracy63.7
13
Attribute BindingCLEVRbind
Accuracy39.9
9
Attribute BindingNCD
Accuracy95.7
9
negationNBvoc
Accuracy49
9
Spatial ReasoningCOCO 1&2obj
Accuracy72.4
9
Spatial ReasoningVG obj 1&2
Accuracy67
9
Attribute BindingVG attr
Accuracy68.1
9
Showing 10 of 12 rows

Other info

Follow for update