Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TempCLR: Temporal Alignment Representation with Contrastive Learning

About

Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level comparison may ignore global temporal context, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal succession by shuffling video clips w.r.t. temporal granularity. Then, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design.

Yuncong Yang, Jiawei Ma, Shiyuan Huang, Long Chen, Xudong Lin, Guangxing Han, Shih-Fu Chang• 2022

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalYoucook2 (test)
Recall@1097.9
54
Action Step LocalizationCrossTask (test)
Recall52.5
32
Action SegmentationCOIN
Frame Accuracy68.7
29
Action Step LocalizationCrossTask
Average Recall36.9
28
Video Question AnsweringMSRVTT (test)
Accuracy92.2
26
Video-paragraph retrievalYouCookII Background Removed (test)
R@184.9
12
Video Retrieval (clip-caption)YouCookII (evaluation)
Recall@123.3
11
Action RecognitionSomething-Something v2
Accuracy (1-shot)47.8
7
Full-video retrievalYouCookII Background Removed (test)
R@183.5
7
Video Retrieval (clip-caption)DiDeMo (test)
R@117.7
7
Showing 10 of 11 rows

Other info

Code

Follow for update