Skeleton-Contrastive 3D Action Representation Learning
About
This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets with multiple downstream tasks, including action recognition, action retrieval and semi-supervised learning. Code is available at https://github.com/fmthoker/skeleton-contrast.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy67.1 | 661 | |
| Action Recognition | NTU RGB+D (Cross-View) | Accuracy90.4 | 609 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy85.2 | 575 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy76.3 | 467 | |
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy67.9 | 377 | |
| Action Recognition | NTU RGB-D Cross-Subject 60 | Accuracy70.8 | 305 | |
| Skeleton-based Action Recognition | NTU 60 (X-sub) | Accuracy65.9 | 220 | |
| Action Recognition | NTU 120 (Cross-Setup) | Accuracy67.9 | 112 | |
| Action Recognition | NTU-120 (cross-subject (xsub)) | Accuracy67.9 | 82 | |
| Action Recognition | PKU-MMD Part I | Accuracy80.9 | 53 |