XR: Cross-Modal Agents for Composed Image Retrieval
About
Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Composed Image Retrieval | FashionIQ (val) | Shirt Recall@1038.91 | 455 | |
| Composed Image Retrieval (Image-Text to Image) | CIRR | Recall@143.13 | 75 | |
| Composed Image Retrieval | CIRCO | mAP@531.38 | 63 | |
| Composed Image Retrieval | Fashion-IQ | -- | 40 | |
| Composed Image Retrieval | FashionIQ Toptee | Recall@1043.91 | 20 | |
| Composed Image Retrieval | FashionIQ (Dress) | Recall@1028.71 | 20 | |
| Composed Image Retrieval | FashionIQ Shirt | Recall@1038.91 | 20 |