Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

About

Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

Jonas Herzog, Yue Wang• 2026

Related benchmarks

TaskDatasetResultRank
Image RetrievalROxford--
67
Few-shot Image ClassificationAverage 11 datasets (test)
Average Accuracy (Few-shot)81
47
Image RetrievalRParis--
44
Classification11-Dataset Few-shot Learning Suite 16-shot
Average Accuracy (16-shot)86.5
33
Image-to-Image RetrievalCars
mAP80.2
30
Image-to-Image RetrievalAircraft
mAP46.3
30
Image-to-Image RetrievalEuroSAT
mAP62.8
30
Image-to-Image RetrievalCaltech
mAP93
30
Image-to-Image RetrievalPets
mAP68.5
30
Image-to-Image RetrievalFlowers
mAP93.2
30
Showing 10 of 18 rows

Other info

Follow for update