FreeRet: MLLMs as Training-Free Retrievers

About

Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Chunxu Liu, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang• 2025

Related benchmarks

Task	Dataset	Result
Image Retrieval	Flickr30K	R@179.1	164
Text Retrieval	Flickr30K	R@180.8	120
Multimodal Retrieval	MMEB	Classification Score69.4	94
Text Retrieval	COCO	R@155.3	59
Image Retrieval	COCO	R@152	53
Video Classification	MMEB Kinetics-700, SSv2, HMDB, UCF, Breakfast v2	Classification Accuracy63.2	10
Video Retrieval	MMEB Video Retrieval (MSRVTT, MSVD, DiDeMo, YouCook2, VATEX) v2	Retrieval Score39.3	10
Video Classification	MMEB Video Classification (Kinetics-700, SSv2, HMDB, UCF, Breakfast) v2 (test)	Classification Accuracy63.2	8
Video Retrieval	MMEB Video Retrieval (MSRVTT, MSVD, DiDeMo, YouCook2, VATEX) v2 (test)	Retrieval Score39.3	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord