Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

About

Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code will be available at https://github.com/Pter61/osrcir2024/.

Yuanmin Tang, Xiaoting Qin, Jue Zhang, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Ling, Saravan Rajmohan, Dongmei Zhang, Qi Wu• 2024

Related benchmarks

Task	Dataset	Result
Composed Image Retrieval	CIRR (test)	Recall@137.59	786
Composed Image Retrieval	FashionIQ (val)	Average Recall@1037.57	601
Composed Image Retrieval	CIRCO (test)	mAP@1031.14	360
Composed Image Retrieval	Fashion-IQ (test)	Average Recall@100.371	176
Composed Image Retrieval	Fashion-IQ	Average Recall@5057.15	129
Composed Image Retrieval (Image-Text to Image)	CIRR	Recall@137.59	128
Composed Image Retrieval	CIRCO	mAP@525.62	96
Composed Image Retrieval	FashionIQ Shirt	Recall@1038.65	64
Composed Image Retrieval	FashionIQ (Dress)	Recall@1033.02	39
Composed Image Retrieval	GeneCIS (test)	Recall@117.9	38

Showing 10 of 20 rows

Other info

Code

Follow for update

@wizwand_team Discord