Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning by Aligning Videos in Time

About

We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network. Specifically, the temporal alignment loss (i.e., Soft-DTW) aims for the minimum cost for temporally aligning videos in the embedding space. However, optimizing solely for this term leads to trivial solutions, particularly, one where all frames get mapped to a small cluster in the embedding space. To overcome this problem, we propose a temporal regularization term (i.e., Contrastive-IDM) which encourages different frames to be mapped to different points in the embedding space. Extensive evaluations on various tasks, including action phase classification, action phase progression, and fine-grained frame retrieval, on three datasets, namely Pouring, Penn Action, and IKEA ASM, show superior performance of our approach over state-of-the-art methods for self-supervised representation learning from videos. In addition, our method provides significant performance gain where labeled data is lacking. Our code and labels are available on our research website: https://retrocausal.ai/research/

Sanjay Haresh, Sateesh Kumar, Huseyin Coskun, Shahram Najam Syed, Andrey Konin, Muhammad Zeeshan Zia, Quoc-Huy Tran• 2021

Related benchmarks

TaskDatasetResultRank
Action phase classificationPenn-Action
Phase Classification Accuracy84.25
48
Phase classificationPenn-Action (test)
Accuracy84.47
45
Video AlignmentPenn-Action
Kendall's Tau0.8047
33
Action phase classificationCOIN
Phase Acc39.81
32
Action phase classificationPouring
Phase Classification Accuracy92.84
24
Phase classificationPouring
Accuracy93.07
24
Phase classificationIKEA ASM
Accuracy30.51
24
Action phase classificationIKEA ASM No Background
Phase Classification Accuracy30.43
24
Action phase classificationIKEA ASM (Background)
Accuracy (Phase Classification)25.54
24
Temporal AlignmentPenn Action 11 actions (test)
Kendall's Tau0.8149
9
Showing 10 of 20 rows

Other info

Code

Follow for update