Adapting MLLMs for Nuanced Video Retrieval

About

Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with carefully sampled hard negatives that instill the desired nuances in the learned embedding space. Despite the text-only training, our method achieves state of the art performance on all benchmarks for nuanced video retrieval. We also analyze how this improvement is achieved, and show that text-only training reduces the modality gap between text and video embeddings leading to better organization of the embedding space.

Piyush Bagad, Andrew Zisserman• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	MSR-VTT	--	406
Text-to-Image Retrieval	COCO	--	156
Composed Video Retrieval	WebVid-CoVR (test)	R@153.1	79
Verb recognition	Epic-Kitchens (EK)	Top-1 Acc6.1	22
Text-to-Video Retrieval	Something-Something CiA-Retrieval v2	mAP (Chiral)85.1	16
Video-to-Text retrieval	Something-Something CiA-Retrieval v2	R@1 (Chiral)84	16
Text-to-Video Retrieval	ReversedInTime	Binary Accuracy71.6	11
Video-to-Text retrieval	ReversedInTime	Binary Accuracy71.3	11
Chiral Action Recognition	CiA	SSv2 Accuracy90.8	9
Video Classification	MMEB Video Classification (Kinetics-700, SSv2, HMDB, UCF, Breakfast) v2 (test)	Classification Accuracy63.7	8

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord