Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding
About
Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models, then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multi-modal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of uni-modal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeleton-based action representation learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy77.2 | 661 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy90.9 | 575 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy84.2 | 467 | |
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy75.2 | 377 | |
| Skeleton-based Action Recognition | NTU 60 (X-sub) | Accuracy84.4 | 220 | |
| Skeleton-based Action Recognition | NTU RGB+D 120 (X-set) | Top-1 Accuracy77.2 | 184 | |
| Action Recognition | NTU RGB+D X-View 60 | Accuracy91.4 | 172 | |
| Skeleton-based Action Recognition | NTU 120 (X-sub) | -- | 139 | |
| Skeleton-based Action Recognition | NTU RGB+D 60 (X-View) | Top-1 Accuracy91.4 | 126 | |
| Action Recognition | NTU-120 (cross-subject (xsub)) | Accuracy75.8 | 82 |