Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval

About

Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Existing CDIR methods rely either on supervised learning with labeled cross-domain correspondences or on methods that require training or fine-tuning on target datasets, often struggling with substantial domain gaps and limited generalization to unseen domains. This paper introduces a novel CDIR approach that incorporates textual context by leveraging publicly available pre-trained vision-language models. Our method, Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or further training. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in plug-and-play settings with consistent improvements on Office-Home and DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.

Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki• 2024

Related benchmarks

Task	Dataset	Result	Rank
Cross-Domain Image Retrieval	Office-Home	Retrieval Accuracy (Ar -> Cl)78.3		18
Cross-Domain Image Retrieval	DomainNet	C-S Score96.8		15

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord