A CLIP-Hitchhiker's Guide to Long Video Retrieval
About
Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video-to-Text retrieval | MSR-VTT | Recall@147.7 | 157 | |
| Action Recognition | Charades | mAP0.449 | 64 | |
| Text-to-Video Retrieval | MSR-VTT 1k-A (test) | R@147.7 | 57 | |
| Video Classification | Charades | mAP21.1 | 38 | |
| Text-to-Video Retrieval | ActivityNet (val1) | R@144 | 28 | |
| Text-to-Video Retrieval | MSRVTT 1k-A (test) | R@147.7 | 23 | |
| Text-to-Video Retrieval | ActivityNet Captions (val) | R@144 | 11 | |
| Detour video retrieval | Detours (test) | R@58.4 | 10 | |
| Multi-Label Classification | Charades | mAP44.9 | 8 | |
| Text-to-Video Retrieval | MSR-VTT full 7k train (test) | R@134.9 | 6 |