Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

About

Prior works on action representation learning mainly focus on designing various architectures to extract the global representations for short video clips. In contrast, many practical applications such as video alignment have strong demand for learning dense representations for long videos. In this paper, we introduce a novel contrastive action representation learning (CARL) framework to learn frame-wise action representations, especially for long videos, in a self-supervised manner. Concretely, we introduce a simple yet efficient video encoder that considers spatio-temporal context to extract frame-wise representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views obtained through a series of spatio-temporal data augmentations. SCL optimizes the embedding space by minimizing the KL-divergence between the sequence similarity of two augmented views and a prior Gaussian distribution of timestamp distance. Experiments on FineGym, PennAction and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification. Surprisingly, although without training on paired videos, our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks. Code and models are available at https://github.com/minghchen/CARL_code.

Minghao Chen, Fangyun Wei, Chong Li, Deng Cai• 2022

Related benchmarks

TaskDatasetResultRank
Action RecognitionFineGYM
Accuracy41.8
29
Action phase classificationBreak Eggs
F1 Score43.43
27
Frame retrievalBreak Eggs
mAP@1046.04
27
Action phase classificationPour Liquid
F1 Score56.98
21
Frame retrievalPour Liquid
mAP@1059.37
21
Action phase classificationPour Milk
F1 Score52.41
21
Action phase classificationTennis Forehand
F1 Score59.69
21
Frame retrievalTennis Forehand
mAP@100.6943
21
Frame retrievalPour Milk
mAP@1055.01
21
Exo2ego Frame RetrievalCMU-MMAC
mAP@540.37
10
Showing 10 of 29 rows

Other info

Follow for update