Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

About

Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes sub-optimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn spatio-temporally fine-grained video representations via self-supervision. We first design a region-based pretext task which requires the model to transform in-stance representations from one view to another, guided by context features. Further, we introduce a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and show that ConST-CL achieves competitive results on 6 datasets, including Kinetics, UCF, HMDB, AVA-Kinetics, AVA and OTB.

Liangzhe Yuan, Rui Qian, Yin Cui, Boqing Gong, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu• 2021

Related benchmarks

TaskDatasetResultRank
Video Action RecognitionKinetics-400
Top-1 Acc66.6
184
Action RecognitionUCF101 (3 splits)
Accuracy94.8
155
Video Action RecognitionHMDB-51 (3 splits)
Accuracy71.9
116
Single Object TrackingOTB 2015 (val)
Precision78.1
8
Video Action RecognitionUCF101 (train val)
Top-1 Acc94.8
8
Spatio-temporal Action LocalizationAVA v2.2 (val)
mAP (Det)24.1
7
Video Action RecognitionHMDB51 (train/val)
Top-1 Acc71.9
7
Video Action RecognitionKinetics-400 (train/val)
Top-1 Acc66.6
7
Spatio-temporal Action LocalizationAVA-Kinetics v2.2 (val)
mAP (GT)39.4
5
Showing 9 of 9 rows

Other info

Code

Follow for update