Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

About

In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.

Arun Reddy, Alexander Martin, Eugene Yang, Andrew Yates, Kate Sanders, Kenton Murray, Reno Kriz, Celso M. de Melo, Benjamin Van Durme, Rama Chellappa• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	DiDeMo	R@10.445	472
Text-to-Video Retrieval	MSVD	R@147.7	297
Text-to-Video Retrieval	VATEX	R@161.1	134
Video-to-Text retrieval	MSRVTT	R@574.4	45
Text-to-Video Retrieval	MSRVTT	Recall@150	38
Video Retrieval	MSR-VTT	R@151.5	34
Video Retrieval	MULTIVENT 2.0 (test)	Recall@1034.1	12
Article Generation	WikiVideo (test)	InfoP Score83.8	10
Multimodal Retrieval	WikiVideo (test)	Alpha-nDCG43.1	10

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord