GENIUS: A Generative Framework for Universal Multimodal Search
About
Generative retrieval is an emerging approach in information retrieval that generates identifiers (IDs) of target data based on a query, providing an efficient alternative to traditional embedding-based retrieval methods. However, existing models are task-specific and fall short of embedding-based retrieval in performance. This paper proposes GENIUS, a universal generative retrieval framework supporting diverse tasks across multiple modalities and domains. At its core, GENIUS introduces modality-decoupled semantic quantization, transforming multimodal data into discrete IDs encoding both modality and semantics. Moreover, to enhance generalization, we propose a query augmentation that interpolates between a query and its target, allowing GENIUS to adapt to varied query forms. Evaluated on the M-BEIR benchmark, it surpasses prior generative methods by a clear margin. Unlike embedding-based retrieval, GENIUS consistently maintains high retrieval speed across database size, with competitive performance across multiple benchmarks. With additional re-ranking, GENIUS often achieves results close to those of embedding-based methods while preserving efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@174.1 | 423 | |
| Image-to-Text Retrieval | MSCOCO | -- | 124 | |
| Text-to-Image Retrieval | MSCOCO | -- | 118 | |
| Text-to-Image Retrieval | MS-COCO | R@578 | 79 | |
| Composed Image Retrieval (Image-Text to Image) | CIRR | -- | 75 | |
| Text-to-Image Retrieval | MS-COCO (test) | R@146.1 | 66 | |
| Image-to-Text Retrieval | MS-COCO | R@591.1 | 65 | |
| Image-text-to-text retrieval | InfoSeek | Recall@520.7 | 20 | |
| Multi-modal retrieval (Text to Text/Image-Text) | WebQA | Recall@560.6 | 19 | |
| Composed Image Retrieval (Image-Text to Image) | FashionIQ | Recall@1019.2 | 19 |