PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

About

Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.

Sanghyeon Lee, Jinwoo Kim, Jong Taek Lee• 2026

Related benchmarks

Task	Dataset	Result
Skeleton-based Action Recognition	NTU-RGB+D 120 (X-Sub)	Accuracy43.7	79
Action Recognition	NTU-RGB+D 120 (96/24)	Accuracy65	35
Skeleton-based Action Recognition	NTU RGB+D 60 (X-sub)	Top-1 Accuracy0.414	34
Zero-Shot Skeleton-based Action Recognition	NTU-RGB+D Xsub 60 (55/5)	Accuracy90.5	29
Zero-Shot Skeleton-based Action Recognition	NTU-RGB+D Xsub 120 (110/10)	Accuracy80.3	29
Action Recognition	NTU RGB+D 60 (55/5)	Accuracy87.8	27
Zero-Shot Skeleton-based Action Recognition	NTU-RGB+D Xsub 60 (48/12)	Accuracy73.2	20
Zero-Shot Skeleton-based Action Recognition	NTU-RGB+D Xsub 120 (96/24)	Accuracy70.9	20
Action Recognition	Kinetics-skeleton 200 (180/20)	Top-1 Accuracy55.6	11
Skeleton-based Action Recognition	Kinetics-400 (320/80)	Top-1 Accuracy39.6	11

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord