Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

About

Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data. Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates, which is challenged by training cost and inference latency with large-scale data. Inspired by the remarkable performance and efficiency of generative models, we propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling, which assigns identifiers to each candidate and treats the generating identifier as the retrieval target. Specifically, we explore an effective coarse-to-fine scheme, combining K-Means and RQ-VAE to discretize multimodal data into token sequences that support autoregressive generation. Further, considering the lack of explicit interaction between queries and candidates, we propose a feature fusion strategy to align their semantics. Extensive experiments demonstrate the effectiveness of the strategies in the CART, achieving excellent results in both retrieval performance and efficiency.

Minghui Fang, Shengpeng Ji, Jialong Zuo, Hai Huang, Yan Xia, Jieming Zhu, Xize Cheng, Xiaoda Yang, Wenrui Liu, Gang Wang, Zhenhua Dong, Zhou Zhao• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K
R@181.78
460
Text-to-Image RetrievalFlickr30k (test)
Recall@181.8
423
Text-to-Video RetrievalMSR-VTT (test)
R@152.6
234
Text-to-Video RetrievalMSVD (test)
R@163.6
204
Text-to-Image RetrievalMS-COCO
R@577.53
79
Text-to-Image RetrievalMS-COCO (test)
R@152.4
66
Cross-modal retrievalClotho (test)
R@146.4
29
Cross-modal retrievalAudioCaps (test)
R@149.8
23
Showing 8 of 8 rows

Other info

Follow for update