3D Human Action Representation Learning via Cross-View Consistency Pursuit
About
In this work, we propose a Cross-view Contrastive Learning framework for unsupervised 3D skeleton-based action Representation (CrosSCLR), by leveraging multi-view complementary supervision signal. CrosSCLR consists of both single-view contrastive learning (SkeletonCLR) and cross-view consistent knowledge mining (CVC-KM) modules, integrated in a collaborative learning manner. It is noted that CVC-KM works in such a way that high-confidence positive/negative samples and their distributions are exchanged among views according to their embedding similarity, ensuring cross-view consistency in terms of contrastive context, i.e., similar distributions. Extensive experiments show that CrosSCLR achieves remarkable action recognition results on NTU-60 and NTU-120 datasets under unsupervised settings, with observed higher-quality action representations. Our code is available at https://github.com/LinguoLi/CrosSCLR.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy80.4 | 717 | |
| Action Recognition | NTU RGB+D (Cross-View) | Accuracy92.5 | 652 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy92.5 | 588 | |
| Action Recognition | NTU RGB+D (Cross-subject) | Accuracy86.2 | 500 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy86.2 | 467 | |
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy80.5 | 430 | |
| Action Recognition | NTU RGB-D Cross-Subject 60 | Accuracy86.2 | 336 | |
| Action Recognition | NTU-60 (xsub) | Accuracy86.2 | 223 | |
| Action Recognition | NTU RGB+D 120 Cross-Subject | Accuracy80.5 | 222 | |
| Skeleton-based Action Recognition | NTU 60 (X-sub) | Accuracy86.2 | 220 |