Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition

About

Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an ``align-then-classify'' paradigm but face two fundamental issues, \textit{i.e.}, (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed \texttt{\textbf{Flora}}, which builds upon \textbf{F}lexib\textbf{L}e neighb\textbf{O}r-aware semantic attunement and open-form dist\textbf{R}ibution-aware flow cl\textbf{A}ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.

Yang Chen, Miaoge Li, Zhijie Rao, Deze Zeng, Song Guo, Jingcai Guo• 2025

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D X-sub 120	Accuracy65.9	482
Action Recognition	NTU RGB-D Cross-Subject 60	Accuracy65.3	358
Action Recognition	NTU-60 (xsub)	Accuracy88.6	271
Action Recognition	NTU-120 (cross-subject (xsub))	Accuracy71.2	239
Action Recognition	NTU-60 48/12 split	Top-1 Acc56.1	119
Action Recognition	NTU-120 96/24 split	Top-1 Acc65.9	100
Action Recognition	NTU RGB+D 120 (110/10 Xsub)	Accuracy78.9	66
Action Recognition	NTU RGB+D Xsub 60 (Cross-Subject 55/5)	Accuracy86.3	66
Action Recognition	NTU-RGB+D 60 (48/12)	Accuracy56.1	49
Action Recognition	PKU-MMD 46/5 I (Xsub)	Accuracy79.1	43

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord