Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

About

Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C$^2$VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results. Code is available at: https://github.com/cseeyangchen/C2VL.

Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, Hong Cheng• 2024

Related benchmarks

TaskDatasetResultRank
Action RecognitionNTU RGB+D 120 (X-set)
Accuracy54.7
770
Action RecognitionNTU RGB+D 60 (Cross-View)
Accuracy76.6
601
Action RecognitionNTU RGB-D Cross-Subject 60
Accuracy69.4
358
Action RecognitionNTU RGB+D 120 Cross-Subject
Accuracy55.7
241
Driver distraction detectionDrive&Act IR
Average Balanced Accuracy57.16
9
Driver distraction detectionDrive&Act Depth
Average Balanced Accuracy51.71
9
Driver distraction detectionDrive&Act Skeleton
Avg Balanced Accuracy35.7
9
Driver distraction detectionDrive&Act Driver IR view
Average Balanced Accuracy41.19
6
Driver distraction detectionDrive&Act Co Driver IR view
Average Balanced Accuracy48.38
6
Driver distraction detectionDrive&Act Kinect IR view
Average Balanced Accuracy57.16
6
Showing 10 of 15 rows

Other info

Follow for update