Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

About

Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks.

Yuxiao Chen, Long Zhao, Jianbo Yuan, Yu Tian, Zhaoyang Xia, Shijie Geng, Ligong Han, Dimitris N. Metaxas• 2022

Related benchmarks

TaskDatasetResultRank
Action RecognitionNTU RGB+D 60 (X-sub)
Accuracy90
467
Skeleton-based Action RecognitionNTU 60 (X-sub)
Accuracy77.7
220
Action RecognitionNTU RGB+D X-View 60
Accuracy95.7
172
Skeleton-based Action RecognitionNTU 120 (X-sub)
Accuracy85.3
139
3D Action RecognitionNTU RGB+D 60 (Cross-View)
Accuracy81.1
29
Action RecognitionNTU 60 (X-view)
Top-1 Acc (1% labels)42.9
22
3D Action RecognitionNTU-120 (X-set)
Top-1 Acc87.4
16
Action LocalizationPKUMMD (test)
mAP@0.566.6
13
Action RecognitionNTU 60 (X-sub)
Top-1 Acc (5% Labels)63.3
11
Action RecognitionNTU cross-subject 60
Acc (1% Labels)39.1
11
Showing 10 of 11 rows

Other info

Follow for update