Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval

About

Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Existing CDIR methods rely either on supervised learning with labeled cross-domain correspondences or on methods that require training or fine-tuning on target datasets, often struggling with substantial domain gaps and limited generalization to unseen domains. This paper introduces a novel CDIR approach that incorporates textual context by leveraging publicly available pre-trained vision-language models. Our method, Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or further training. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in plug-and-play settings with consistent improvements on Office-Home and DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.

Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki• 2024

Related benchmarks

TaskDatasetResultRank
Cross-Domain Image RetrievalOffice-Home
Retrieval Accuracy (Ar -> Cl)78.3
18
Cross-Domain Image RetrievalDomainNet
C-S Score96.8
15
Showing 2 of 2 rows

Other info

Follow for update