Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

About

Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code will be available at https://github.com/Pter61/osrcir2024/.

Yuanmin Tang, Xiaoting Qin, Jue Zhang, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Ling, Saravan Rajmohan, Dongmei Zhang, Qi Wu• 2024

Related benchmarks

TaskDatasetResultRank
Composed Image RetrievalCIRR (test)
Recall@137.59
580
Composed Image RetrievalFashionIQ (val)
Average Recall@1037.57
489
Composed Image RetrievalCIRCO (test)
mAP@1031.14
260
Composed Image Retrieval (Image-Text to Image)CIRR--
93
Composed Image RetrievalFashion-IQ--
80
Composed Image RetrievalCIRCO
mAP@525.62
76
Composed Image RetrievalFashionIQ Shirt
Recall@1038.65
45
Composed Image RetrievalGeneCIS (test)
Recall@117.9
38
Compositional Image RetrievalGeneCIS (test)
Focus Attribute R@122.7
31
Composed Image RetrievalFashion-IQ Dress (val)
Recall@1033.02
25
Showing 10 of 12 rows

Other info

Code

Follow for update