Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

About

We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.

Mengyao Xu, Wenfei Zhou, Yauhen Babakhin, Gabriel Moreira, Ronay Ak, Radek Osmulski, Bo Liu, Even Oldridge, Benedikt Schifferer• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Audio Retrieval	AudioCaps (test)	Recall@120.5	180
Audio-to-Text Retrieval	Clotho	R@13.5	49
Text-to-Video Retrieval	DiDeMo (DDM) zero-shot	R@141.9	36
Text-to-Audio Retrieval	Clotho	R@10.064	31
Text-to-Video Retrieval	MSVD zero-shot	Recall@155.8	26
Retrieval	ATIR A→T (test)	R@175.47	18
Retrieval	MMEB v2	Image Retrieval Score43.7	18
Text-to-Text Retrieval	Clotho	R@157.84	13
Text-to-Audio Retrieval	Clotho (evaluation)	R@17.2	13
Text-to-Audio Retrieval	MECAT (test)	Recall@14.09	13

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord