Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition

About

Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.

Soroush Oraki, Feng Ding, Jie Liang• 2026

Related benchmarks

TaskDatasetResultRank
Action RecognitionNTU-60 48/12 split
Top-1 Acc50
103
Action RecognitionNTU-120 96/24 split
Top-1 Acc54.9
84
Action RecognitionNTU-120 110/10 split
Top-1 Acc73.6
56
Action RecognitionNTU 60 (40-20 seen-unseen)
Top-1 Acc35.5
18
Action RecognitionNTU-60
Top-1 Accuracy84.5
17
Action RecognitionNTU 80/40 120
Top-1 Accuracy32.1
7
Action RecognitionNTU-120 [60/60]
Top-1 Accuracy23.5
7
Showing 7 of 7 rows

Other info

Follow for update