ViA: View-invariant Skeleton Action Representation Learning via Motion Retargeting

About

Current self-supervised approaches for skeleton action representation learning often focus on constrained scenarios, where videos and skeleton data are recorded in laboratory settings. When dealing with estimated skeleton data in real-world videos, such methods perform poorly due to the large variations across subjects and camera viewpoints. To address this issue, we introduce ViA, a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning. ViA leverages motion retargeting between different human performers as a pretext task, in order to disentangle the latent action-specific `Motion' features on top of the visual representation of a 2D or 3D skeleton sequence. Such `Motion' features are invariant to skeleton geometry and camera view and allow ViA to facilitate both, cross-subject and cross-view action classification tasks. We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data (e.g., Posetics). Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy, not only on 3D laboratory datasets such as NTU-RGB+D 60 and NTU-RGB+D 120, but also on real-world datasets where only 2D data are accurately estimated, e.g., Toyota Smarthome, UAV-Human and Penn Action.

Di Yang, Yaohui Wang, Antitza Dantcheva, Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond• 2022

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D 120 (X-set)	Accuracy66.9	779
Action Recognition	NTU RGB+D 60 (Cross-View)	Accuracy85.8	601
Action Recognition	NTU RGB+D 60 (X-sub)	Accuracy89.6	496
Action Recognition	NTU RGB-D Cross-Subject 60	Accuracy78.1	358
Action Recognition	NTU-60 (xsub)	Accuracy89.6	271
Action Recognition	NTU RGB+D 120 Cross-Subject	Accuracy69.2	249
Action Recognition	NTU-120 (cross-subject (xsub))	Accuracy85	239
Action Recognition	NTU 120 (Cross-Setup)	Accuracy86.5	231
Action Recognition	NTU RGB+D X-View 60	Accuracy96.4	218
Action Recognition	NTU-60 (xview)	Accuracy96.4	165

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord