VeRVE: Versatile Retrieval for Videos via Unified Embeddings

About

Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval, fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VeRVE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models.

Shaunak Halbe, Bhagyashree Puranik, Jayakrishnan Unnikrishnan, Kushan Thakkar, Vimal Bhat, Toufiq Parag• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	DiDeMo (test)	R@158.8	407
Text-to-Video Retrieval	MSVD (test)	R@157.8	211
Video-to-Text retrieval	DiDeMo (test)	R@152.5	111
Video-to-Text retrieval	MSVD (test)	R@179	68
Text-to-Video Retrieval	MSR-VTT 1K (test)	R@155.3	65
Natural Language Video Localization	Charades-STA (test)	R@1 (IoU=0.5)36.8	61
Video-to-Text retrieval	MSR-VTT 1K (test)	R@147	39
Natural Language Video Localization	ActivityNet Caption (test)	IoU @ 0.526.7	16
Composed Video Retrieval	CoVR (test)	Recall@168.3	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord