Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Variational Adapter for Cross-modal Similarity Representation

About

The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks. While prior research has attempted to mitigate this by modeling intra-modal ambiguity, it often overlooks inherent annotation flaws, leading to suboptimal uncertainty allocation. To address these challenges, we propose a Variational Adapter for Cross-modal Similarity Representation (VACSR). This approach reformulates image-text matching with fine-grained semantic scarcity as a variational inference problem. It constructs a latent space for cross-modal similarity and uses regularization techniques to mitigate overfitting to binary annotations. Experiments on image-text retrieval, domain generalization, and base-to-novel generalization demonstrate the proposed method's effectiveness and robust generalization ability.

WenZhang Wei, Zhipeng Gui, Dehua Peng, Tiandi Ye, Huayi Wu• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationAverage of 11 datasets (ImageNet, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) Base-to-Novel Generalization
Harmonic Mean (HM)80.37
68
Image-to-Text RetrievalCOCO 5K (test)
R@172.2
57
Text-to-Image RetrievalCOCO 5K (test)
R@154.5
53
Text-to-Image RetrievalECCV Caption
R@192.2
22
Image-to-Text RetrievalCrisscrossed Captions (CxC)
R@173.3
20
Image-to-Text RetrievalECCV Caption (test)
R@184.9
17
Image ClassificationImageNet Out-of-Distribution Variants Cross-domain
Top-1 Acc (ImageNet-V2)65.7
15
Image-Text RetrievalCOCO 1K
R@176.4
15
Text-to-Image RetrievalCrisscrossed Captions (CxC)
R@156.3
15
Image-Text RetrievalCOCO 5k
R@157.1
14
Showing 10 of 12 rows

Other info

Follow for update