Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adapting MLLMs for Nuanced Video Retrieval

About

Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with carefully sampled hard negatives that instill the desired nuances in the learned embedding space. Despite the text-only training, our method achieves state of the art performance on all benchmarks for nuanced video retrieval. We also analyze how this improvement is achieved, and show that text-only training reduces the modality gap between text and video embeddings leading to better organization of the embedding space.

Piyush Bagad, Andrew Zisserman• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSR-VTT--
406
Text-to-Image RetrievalCOCO--
156
Composed Video RetrievalWebVid-CoVR (test)
R@153.1
79
Verb recognitionEpic-Kitchens (EK)
Top-1 Acc6.1
22
Text-to-Video RetrievalSomething-Something CiA-Retrieval v2
mAP (Chiral)85.1
16
Video-to-Text retrievalSomething-Something CiA-Retrieval v2
R@1 (Chiral)84
16
Text-to-Video RetrievalReversedInTime
Binary Accuracy71.6
11
Video-to-Text retrievalReversedInTime
Binary Accuracy71.3
11
Chiral Action RecognitionCiA
SSv2 Accuracy90.8
9
Video ClassificationMMEB Video Classification (Kinetics-700, SSv2, HMDB, UCF, Breakfast) v2 (test)
Classification Accuracy63.7
8
Showing 10 of 18 rows

Other info

Follow for update