Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VIRTUE: Versatile Video Retrieval Through Unified Embeddings

About

Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval and fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VIRTUE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models which are trained on orders of magnitude larger data.

Shaunak Halbe, Bhagyashree Puranik, Jayakrishnan Unnikrishnan, Kushan Thakkar, Vimal Bhat, Toufiq Parag• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo (test)
R@158.8
376
Text-to-Video RetrievalMSVD (test)
R@157.8
204
Video-to-Text retrievalDiDeMo (test)
R@152.5
92
Video-to-Text retrievalMSVD (test)
R@179
61
Natural Language Video LocalizationCharades-STA (test)
R@1 (IoU=0.5)36.8
61
Text-to-Video RetrievalMSR-VTT 1K (test)
R@155.3
45
Video-to-Text retrievalMSR-VTT 1K (test)
R@147
39
Natural Language Video LocalizationActivityNet Caption (test)
IoU @ 0.526.7
16
Composed Video RetrievalCoVR (test)
Recall@168.3
6
Showing 9 of 9 rows

Other info

Follow for update