SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

About

Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ.

Haoran Lou, Ziyan Liu, Chunxiao Fan, Yuexin Wu, Yue Ming, Hao Wu, Kai Zuo, Yibo Chen, Xu Tang• 2026

Related benchmarks

Task	Dataset	Result
Image-to-Text Retrieval	Flickr30K 1K (test)	R@192	491
Text-to-Image Retrieval	Flickr30K 1K (test)	R@181.8	432
Composed Image Retrieval	Fashion-IQ	Average Recall@5043.1	129
Composed Image Retrieval (Image-Text to Image)	CIRR	--	128
Image-to-Text Retrieval	COCO 5K (test)	R@169.6	57
Text-to-Image Retrieval	COCO 5K (test)	R@155.4	53
Image Retrieval	Flickr30k (1K)	R@180.9	21
Multimodal Retrieval	MMEB v1 (test)	Classification60.9	18
Image Retrieval	COCO I2I	R@155.4	7
Text-rendered-as-image Retrieval	Flickr30K I2I	R@190.2	7

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord