Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval

About

Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ''mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ''mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search for the target image. In contrast, we address CIR from first principles by directly generating the ''mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ''mental image'' for a given multimodal query and propose to use this ''mental image'' to search for the target image. As the ''mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.

Tong Wang, Yunhan Zhao, Shu Kong• 2026

Related benchmarks

Task	Dataset	Result
Composed Image Retrieval	CIRR (test)	Recall@139.3	786
Composed Image Retrieval	FashionIQ (val)	Average Recall@1038.74	601
Composed Image Retrieval	CIRCO (test)	mAP@1040.86	360
Composed Image Retrieval	Fashion-IQ	--	129
Composed Image Retrieval (Image-Text to Image)	CIRR	Recall@139.3	128
Composed Image Retrieval	CIRCO	mAP@539.82	96
Composed Image Retrieval	GeneCIS (test)	Recall@117.6	38
Compositional Image Retrieval	GeneCIS (test)	Focus Attribute R@121.4	31
Composed Image Retrieval	GeneCIS	Focus Attribute R@121.4	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord