Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition
About
Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an ``align-then-classify'' paradigm but face two fundamental issues, \textit{i.e.}, (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed \texttt{\textbf{Flora}}, which builds upon \textbf{F}lexib\textbf{L}e neighb\textbf{O}r-aware semantic attunement and open-form dist\textbf{R}ibution-aware flow cl\textbf{A}ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy65.9 | 430 | |
| Action Recognition | NTU RGB-D Cross-Subject 60 | Accuracy65.3 | 336 | |
| Action Recognition | NTU-60 (xsub) | Accuracy88.6 | 223 | |
| Action Recognition | NTU-120 (cross-subject (xsub)) | Accuracy71.2 | 211 | |
| Action Recognition | NTU-60 48/12 split | Top-1 Acc56.1 | 103 | |
| Action Recognition | NTU-120 96/24 split | Top-1 Acc65.9 | 84 | |
| Action Recognition | NTU RGB+D 120 (110/10 Xsub) | Accuracy78.9 | 66 | |
| Action Recognition | NTU-RGB+D 60 (48/12) | Accuracy56.1 | 49 | |
| Action Recognition | PKU-MMD 46/5 I (Xsub) | Accuracy79.1 | 43 | |
| Action Recognition | NTU RGB+D Xsub 60 (Cross-Subject 55/5) | Accuracy86.3 | 40 |