VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
About
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSRVTT | Recall@144.3 | 48 | |
| Video Retrieval | CRB-G | R@182.8 | 18 | |
| Video Retrieval | CRB-T | R@148.7 | 18 | |
| Video Retrieval | CRB-S | R@183.9 | 18 | |
| Video Retrieval | DiDeMo | R@140.3 | 18 | |
| Video Retrieval | UVRB Average of 16 datasets | Average Score49.1 | 18 | |
| Video Retrieval | VDC-O | R@173.5 | 18 | |
| Video Retrieval | DREAM-E | R@126.3 | 18 | |
| Image-to-Video Retrieval | MSRVTT I2V | Recall@186.1 | 18 | |
| Video Retrieval | VDC-D | R@182 | 18 |