Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation

About

In 3D action recognition, there exists rich complementary information between skeleton modalities. Nevertheless, how to model and utilize this information remains a challenging problem for self-supervised 3D action representation learning. In this work, we formulate the cross-modal interaction as a bidirectional knowledge distillation problem. Different from classic distillation solutions that transfer the knowledge of a fixed and pre-trained teacher to the student, in this work, the knowledge is continuously updated and bidirectionally distilled between modalities. To this end, we propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs. On the one hand, the neighboring similarity distribution is introduced to model the knowledge learned in each modality, where the relational information is naturally suitable for the contrastive frameworks. On the other hand, asymmetrical configurations are used for teacher and student to stabilize the distillation process and to transfer high-confidence information between modalities. By derivation, we find that the cross-modal positive mining in previous works can be regarded as a degenerated version of our CMD. We perform extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets. Our approach outperforms existing self-supervised methods and sets a series of new records. The code is available at: https://github.com/maoyunyao/CMD

Yunyao Mao, Wengang Zhou, Zhenbo Lu, Jiajun Deng, Houqiang Li• 2022

Related benchmarks

TaskDatasetResultRank
Action RecognitionNTU RGB+D 120 (X-set)
Accuracy76.1
661
Action RecognitionNTU RGB+D 60 (Cross-View)
Accuracy90.9
575
Action RecognitionNTU RGB+D 60 (X-sub)
Accuracy84.1
467
Action RecognitionNTU RGB+D X-sub 120
Accuracy74.7
377
Skeleton-based Action RecognitionNTU 60 (X-sub)
Accuracy84.1
220
Skeleton-based Action RecognitionNTU RGB+D 120 (X-set)
Top-1 Accuracy76.1
184
Action RecognitionNTU RGB+D 120 Cross-Subject
Accuracy69.1
183
Action RecognitionNTU RGB+D X-View 60
Accuracy90.9
172
Skeleton-based Action RecognitionNTU 120 (X-sub)--
139
Skeleton-based Action RecognitionNTU RGB+D 60 (X-View)
Top-1 Accuracy90.9
126
Showing 10 of 34 rows

Other info

Follow for update