Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

About

Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.

Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, Even Oldridge• 2025

Related benchmarks

Task	Dataset	Result
Visual document retrieval	ViDoRe V2	Avg nDCG@563.5	39
Visual document retrieval	ViDoRe V3	HR58.69	23
Visual document retrieval	JinaVDR	nDCG@1067.8	15
Visual document retrieval	VisRAG	--	13
Visual document retrieval	ViDoRe V1	nDCG@1091	11
Visual document retrieval	Vidore 2	nDCG@1062.1	11
Visual document retrieval	VisDocOOD	nDCG@1069.7	11
Visual document retrieval	Vidore V1 & V2	Avg. Acc83.1	10
Visual document retrieval	MIRACL Vision	Arabic0.4247	8
Document Retrieval	ViDoRe V2	Economics nDCG@557.8	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord