Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

About

Multimodal Large Language Models (MLLMs) exhibit strong reasoning and world knowledge, yet adapting them for retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. In this work, we argue that adapting MLLMs for retrieval should focus on eliciting pre-trained representations rather than overwriting them. To this end, we propose SLQ, an effective and efficient framework that adapts a frozen MLLM into a retriever through a small set of Shared Latent Queries. Appended to the end of both text and image token sequences, these queries leverage the model's native causal attention to serve as global aggregation interfaces, producing compact embeddings in a unified space while keeping the backbone unchanged. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench. The results demonstrate that SLQ, which preserves pre-trained representations, provides an effective and efficient framework for adapting MLLMs to retrieval.

Haoran Lou, Ziyan Liu, Chunxiao Fan, Yuexin Wu, Yue Ming• 2026

Related benchmarks

TaskDatasetResultRank
Image-to-Text RetrievalFlickr30K 1K (test)
R@192
491
Text-to-Image RetrievalFlickr30K 1K (test)
R@181.8
432
Composed Image Retrieval (Image-Text to Image)CIRR
Recall@563.5
93
Composed Image RetrievalFashion-IQ
Average Recall@5043.1
80
Image-to-Text RetrievalCOCO 5K (test)
R@169.6
47
Text-to-Image RetrievalCOCO 5K (test)
R@155.4
43
Image RetrievalFlickr30k (1K)
R@180.9
21
Multimodal RetrievalMMEB v1 (test)
Classification60.9
18
Image RetrievalCOCO I2I
R@155.4
7
Text-rendered-as-image RetrievalFlickr30K I2I
R@190.2
7
Showing 10 of 13 rows

Other info

Follow for update