SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
About
Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@192 | 491 | |
| Text-to-Image Retrieval | Flickr30K 1K (test) | R@181.8 | 432 | |
| Composed Image Retrieval | Fashion-IQ | Average Recall@5043.1 | 129 | |
| Composed Image Retrieval (Image-Text to Image) | CIRR | -- | 128 | |
| Image-to-Text Retrieval | COCO 5K (test) | R@169.6 | 57 | |
| Text-to-Image Retrieval | COCO 5K (test) | R@155.4 | 53 | |
| Image Retrieval | Flickr30k (1K) | R@180.9 | 21 | |
| Multimodal Retrieval | MMEB v1 (test) | Classification60.9 | 18 | |
| Image Retrieval | COCO I2I | R@155.4 | 7 | |
| Text-rendered-as-image Retrieval | Flickr30K I2I | R@190.2 | 7 |