Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VeRVE: Versatile Retrieval for Videos via Unified Embeddings

About

Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval, fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VeRVE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models.

Shaunak Halbe, Bhagyashree Puranik, Jayakrishnan Unnikrishnan, Kushan Thakkar, Vimal Bhat, Toufiq Parag• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo (test)
R@158.8
407
Text-to-Video RetrievalMSVD (test)
R@157.8
211
Video-to-Text retrievalDiDeMo (test)
R@152.5
111
Video-to-Text retrievalMSVD (test)
R@179
68
Text-to-Video RetrievalMSR-VTT 1K (test)
R@155.3
65
Natural Language Video LocalizationCharades-STA (test)
R@1 (IoU=0.5)36.8
61
Video-to-Text retrievalMSR-VTT 1K (test)
R@147
39
Natural Language Video LocalizationActivityNet Caption (test)
IoU @ 0.526.7
16
Composed Video RetrievalCoVR (test)
Recall@168.3
6
Showing 9 of 9 rows

Other info

Follow for update