Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision
About
Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes sub-optimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn spatio-temporally fine-grained video representations via self-supervision. We first design a region-based pretext task which requires the model to transform in-stance representations from one view to another, guided by context features. Further, we introduce a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and show that ConST-CL achieves competitive results on 6 datasets, including Kinetics, UCF, HMDB, AVA-Kinetics, AVA and OTB.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Action Recognition | Kinetics-400 | Top-1 Acc66.6 | 184 | |
| Action Recognition | UCF101 (3 splits) | Accuracy94.8 | 155 | |
| Video Action Recognition | HMDB-51 (3 splits) | Accuracy71.9 | 116 | |
| Single Object Tracking | OTB 2015 (val) | Precision78.1 | 8 | |
| Video Action Recognition | UCF101 (train val) | Top-1 Acc94.8 | 8 | |
| Spatio-temporal Action Localization | AVA v2.2 (val) | mAP (Det)24.1 | 7 | |
| Video Action Recognition | HMDB51 (train/val) | Top-1 Acc71.9 | 7 | |
| Video Action Recognition | Kinetics-400 (train/val) | Top-1 Acc66.6 | 7 | |
| Spatio-temporal Action Localization | AVA-Kinetics v2.2 (val) | mAP (GT)39.4 | 5 |