Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning
About
Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy90 | 467 | |
| Skeleton-based Action Recognition | NTU 60 (X-sub) | Accuracy77.7 | 220 | |
| Action Recognition | NTU RGB+D X-View 60 | Accuracy95.7 | 172 | |
| Skeleton-based Action Recognition | NTU 120 (X-sub) | Accuracy85.3 | 139 | |
| 3D Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy81.1 | 29 | |
| Action Recognition | NTU 60 (X-view) | Top-1 Acc (1% labels)42.9 | 22 | |
| 3D Action Recognition | NTU-120 (X-set) | Top-1 Acc87.4 | 16 | |
| Action Localization | PKUMMD (test) | mAP@0.566.6 | 13 | |
| Action Recognition | NTU 60 (X-sub) | Top-1 Acc (5% Labels)63.3 | 11 | |
| Action Recognition | NTU cross-subject 60 | Acc (1% Labels)39.1 | 11 |