Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Representation Learning via Global Temporal Alignment and Cycle-Consistency

About

We introduce a weakly supervised method for representation learning based on aligning temporal sequences (e.g., videos) of the same process (e.g., human action). The main idea is to use the global temporal ordering of latent correspondences across sequence pairs as a supervisory signal. In particular, we propose a loss based on scoring the optimal sequence alignment to train an embedding network. Our loss is based on a novel probabilistic path finding view of dynamic time warping (DTW) that contains the following three key features: (i) the local path routing decisions are contrastive and differentiable, (ii) pairwise distances are cast as probabilities that are contrastive as well, and (iii) our formulation naturally admits a global cycle consistency loss that verifies correspondences. For evaluation, we consider the tasks of fine-grained action classification, few shot learning, and video synchronization. We report significant performance increases over previous methods. In addition, we report two applications of our temporal alignment framework, namely 3D pose reconstruction and fine-grained audio/visual retrieval.

Isma Hadji, Konstantinos G. Derpanis, Allan D. Jepson• 2021

Related benchmarks

TaskDatasetResultRank
Audio-Visual Event LocalizationAVE--
35
Video AlignmentPenn-Action
Kendall's Tau0.7829
33
Action phase classificationBreak Eggs
F1 Score58.35
27
Frame retrievalBreak Eggs
mAP@1061.55
27
Frame retrievalTennis Forehand
mAP@100.852
21
Action phase classificationPour Liquid
F1 Score59.96
21
Frame retrievalPour Liquid
mAP@1062.79
21
Action phase classificationPour Milk
F1 Score81.51
21
Action phase classificationTennis Forehand
F1 Score83.63
21
Frame retrievalPour Milk
mAP@1080.12
21
Showing 10 of 39 rows

Other info

Code

Follow for update