Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition

About

While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this paper, we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzh1/PURLS.

Anqi Zhu, Qiuhong Ke, Mingming Gong, James Bailey• 2024

Related benchmarks

TaskDatasetResultRank
Action RecognitionNTU RGB+D 60 (X-sub)
Accuracy79.23
467
Skeleton-based Action RecognitionNTU RGB+D 120 Cross-Subject
Top-1 Accuracy72
143
Action RecognitionNTU RGB+D 120 (Cross-View)
Accuracy71.95
47
Action RecognitionNTU 60 (55/5 split)
Top-1 Acc79.23
35
Action RecognitionNTU-120 110/10 split
Top-1 Acc71.95
34
Skeleton Action RecognitionNTU RGB+D Cross-Subject (Xsub) 120
Accuracy52
29
Action RecognitionNTU-60 48/12 split
Top-1 Acc40.99
27
Action RecognitionNTU-120 96/24 split
Top-1 Acc52.01
18
Zero-shot Action RecognitionNTU-RGB+D 120 (96/24)
Top-1 Acc52.01
16
Zero-shot Action RecognitionNTU RGB+D 120 (110/10 Split)
Top-1 Accuracy71.95
16
Showing 10 of 42 rows

Other info

Code

Follow for update