Resolving Ambiguity in Composed Image Retrieval via Calibrated Interaction
About
Composed image retrieval (CIR) searches a corpus with a reference image and a text describing how to modify it. Despite rapid progress from triplet-trained compositors to zero-shot and generative methods, essentially all systems share one assumption: that a query maps to a single target, scored by Recall@K against one annotation. We argue this is fundamentally at odds with the task. A query such as make it more formal does not name an image but a region of the corpus, and which member the user intends is genuinely underdetermined. This underspecification is the root of the well-known false-negative problem and leaves current models unable to tell a precise query from an ambiguous one. We reframe CIR as calibrated intent resolution under uncertainty: a retriever is wrapped in a conformal prediction layer that returns a candidate set with a coverage guarantee and whose size is a principled measure of ambiguity; when the set is large, an expected-information-gain policy asks the single most useful clarifying question, drawn from interpretable ambiguity axes, and the set contracts. We introduce AmbiCIR, a benchmark and human-validated user simulator that revive the dormant auxiliary and dialogue annotations of CIRR and extend the multiple-positive setting of CIRCO. Across open-domain and fashion benchmarks our method matches single-turn state of the art, confirming calibrated resolution is cost-free on precise queries, while reaching the intended target in a fraction of the interaction budget required by naive conversational baselines, and it is the first to report valid coverage and calibration for the task.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Composed Image Retrieval | CIRR (test) | Recall@130.1 | 786 | |
| Composed Image Retrieval | CIRCO (test) | mAP@1024.8 | 360 | |
| Composed Image Retrieval | Fashion-IQ (test) | Average Recall@100.385 | 176 | |
| Composed Image Retrieval (Image-Text to Image) | CIRR | Recall@191.3 | 128 | |
| Composed Image Retrieval | CIRCO | mAP@536 | 96 | |
| Composed Image Retrieval (Image-Text to Image) | FashionIQ | Recall@1054.6 | 39 | |
| Composed Image Retrieval | CIRR Subset (test) | R@158.4 | 33 |