Adapting MLLMs for Nuanced Video Retrieval
About
Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with carefully sampled hard negatives that instill the desired nuances in the learned embedding space. Despite the text-only training, our method achieves state of the art performance on all benchmarks for nuanced video retrieval. We also analyze how this improvement is achieved, and show that text-only training reduces the modality gap between text and video embeddings leading to better organization of the embedding space.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSR-VTT | -- | 406 | |
| Text-to-Image Retrieval | COCO | -- | 156 | |
| Composed Video Retrieval | WebVid-CoVR (test) | R@153.1 | 79 | |
| Verb recognition | Epic-Kitchens (EK) | Top-1 Acc6.1 | 22 | |
| Text-to-Video Retrieval | Something-Something CiA-Retrieval v2 | mAP (Chiral)85.1 | 16 | |
| Video-to-Text retrieval | Something-Something CiA-Retrieval v2 | R@1 (Chiral)84 | 16 | |
| Text-to-Video Retrieval | ReversedInTime | Binary Accuracy71.6 | 11 | |
| Video-to-Text retrieval | ReversedInTime | Binary Accuracy71.3 | 11 | |
| Chiral Action Recognition | CiA | SSv2 Accuracy90.8 | 9 | |
| Video Classification | MMEB Video Classification (Kinetics-700, SSv2, HMDB, UCF, Breakfast) v2 (test) | Classification Accuracy63.7 | 8 |