ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints

About

Learning self-supervised video representation predominantly focuses on discriminating instances generated from simple data augmentation schemes. However, the learned representation often fails to generalize over unseen camera viewpoints. To this end, we propose ViewCLR, that learns self-supervised video representation invariant to camera viewpoint changes. We introduce a view-generator that can be considered as a learnable augmentation for any self-supervised pre-text tasks, to generate latent viewpoint representation of a video. ViewCLR maximizes the similarities between the latent viewpoint representation with its representation from the original viewpoint, enabling the learned video encoder to generalize over unseen camera viewpoints. Experiments on cross-view benchmark datasets including NTU RGB+D dataset show that ViewCLR stands as a state-of-the-art viewpoint invariant self-supervised method.

Srijan Das, Michael S. Ryoo• 2021

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D 120 (X-set)	Accuracy86.2	770
Action Recognition	NTU RGB+D (Cross-View)	Accuracy94.1	652
Action Recognition	NTU RGB+D 60 (Cross-View)	Accuracy94.1	601
Action Recognition	NTU RGB+D (Cross-subject)	Accuracy89.7	500
Action Recognition	NTU RGB+D 60 (X-sub)	Accuracy89.7	496
Action Recognition	NTU RGB+D X-sub 120	Accuracy84.5	473
Action Recognition	NTU RGB+D 120 Cross-Subject	Accuracy84.5	241
Action Recognition	NTU 120 (Cross-Setup)	Accuracy84.5	231
Action Recognition	NTU120 (cross-subject (CS))	Top-1 Accuracy86.2	36
Action Recognition	NTU-60 (Cross-Subject (CS))	Top-1 Accuracy89.7	31

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord