Spatial-then-Temporal Self-Supervised Learning for Video Correspondence

About

In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images or videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses.

Rui Li, Dong Liu• 2022

Related benchmarks

Task	Dataset	Result
Video Object Segmentation	DAVIS 2017 (val)	J mean71.1	1226
Human Pose Tracking	JHMDB (val)	PCK@.163.1	15
Human Part Propagation	VIP (val)	mIoU41	12

Showing 3 of 3 rows

Other info

Code

Follow for update

@wizwand_team Discord