Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model
About
Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual document retrieval | ViDoRe V3 | HR58.69 | 23 | |
| Visual document retrieval | JinaVDR | nDCG@1067.8 | 15 | |
| Visual document retrieval | ViDoRe V1 | nDCG@1091 | 11 | |
| Visual document retrieval | Vidore 2 | nDCG@1062.1 | 11 | |
| Visual document retrieval | VisDocOOD | nDCG@1069.7 | 11 | |
| Visual document retrieval | VisRAG | nDCG@1085.5 | 11 | |
| Visual document retrieval | Vidore V1 & V2 | Avg. Acc83.1 | 10 | |
| Visual document retrieval | MIRACL Vision | Arabic0.4247 | 8 |