Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A CLIP-Hitchhiker's Guide to Long Video Retrieval

About

Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks.

Max Bain, Arsha Nagrani, G\"ul Varol, Andrew Zisserman• 2022

Related benchmarks

TaskDatasetResultRank
Video-to-Text retrievalMSR-VTT
Recall@147.7
157
Action RecognitionCharades
mAP0.449
64
Text-to-Video RetrievalMSR-VTT 1k-A (test)
R@147.7
57
Video ClassificationCharades
mAP21.1
38
Text-to-Video RetrievalActivityNet (val1)
R@144
28
Text-to-Video RetrievalMSRVTT 1k-A (test)
R@147.7
23
Text-to-Video RetrievalActivityNet Captions (val)
R@144
11
Detour video retrievalDetours (test)
R@58.4
10
Multi-Label ClassificationCharades
mAP44.9
8
Text-to-Video RetrievalMSR-VTT full 7k train (test)
R@134.9
6
Showing 10 of 11 rows

Other info

Follow for update