Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation
About
We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at https://github.com/Alchemist0754/Skeleton-Cache.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Zero-shot Action Recognition | NTU RGB+D 60 (55/5 Split) | Top-1 Accuracy89.41 | 16 | |
| Zero-shot Action Recognition | NTU RGB+D 60 (48/12 Split) | Top-1 Acc52.03 | 16 | |
| Zero-shot Action Recognition | NTU RGB+D 120 (110/10 Split) | Top-1 Accuracy77.6 | 16 | |
| Zero-shot Action Recognition | NTU-RGB+D 120 (96/24) | Top-1 Acc56.83 | 16 | |
| Skeleton Action Recognition | NTU RGB+D 120 (Cross-Setup (Xset), 110/10 Split) | S Score62.19 | 13 | |
| Skeleton-based Action Recognition | NTU RGB+D 60 (55/5 Split) | ZSL Accuracy89.41 | 11 | |
| Skeleton-based Action Recognition | NTU RGB+D 60 (48/12 Split) | ZSL47.83 | 11 | |
| Skeleton-based Action Recognition | NTU RGB+D 120 (96/24 Split) | ZSL Accuracy56.83 | 11 | |
| Skeleton-based Action Recognition | NTU 60 (random-split) | ZSL Accuracy89.86 | 9 | |
| Skeleton-based Action Recognition | NTU 120 (random-split) | ZSL Accuracy56.18 | 9 |