Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

About

Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.

Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, Even Oldridge• 2025

Related benchmarks

TaskDatasetResultRank
Visual document retrievalViDoRe V3
HR58.69
23
Visual document retrievalJinaVDR
nDCG@1067.8
15
Visual document retrievalViDoRe V1
nDCG@1091
11
Visual document retrievalVidore 2
nDCG@1062.1
11
Visual document retrievalVisDocOOD
nDCG@1069.7
11
Visual document retrievalVisRAG
nDCG@1085.5
11
Visual document retrievalVidore V1 & V2
Avg. Acc83.1
10
Visual document retrievalMIRACL Vision
Arabic0.4247
8
Showing 8 of 8 rows

Other info

Follow for update