Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

About

Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.

Sanghyeon Lee, Jinwoo Kim, Jong Taek Lee• 2026

Related benchmarks

TaskDatasetResultRank
Skeleton-based Action RecognitionNTU-RGB+D 120 (X-Sub)
Accuracy43.7
79
Action RecognitionNTU-RGB+D 120 (96/24)
Accuracy65
35
Skeleton-based Action RecognitionNTU RGB+D 60 (X-sub)
Top-1 Accuracy0.414
34
Zero-Shot Skeleton-based Action RecognitionNTU-RGB+D Xsub 60 (55/5)
Accuracy90.5
29
Zero-Shot Skeleton-based Action RecognitionNTU-RGB+D Xsub 120 (110/10)
Accuracy80.3
29
Action RecognitionNTU RGB+D 60 (55/5)
Accuracy87.8
27
Zero-Shot Skeleton-based Action RecognitionNTU-RGB+D Xsub 60 (48/12)
Accuracy73.2
20
Zero-Shot Skeleton-based Action RecognitionNTU-RGB+D Xsub 120 (96/24)
Accuracy70.9
20
Action RecognitionKinetics-skeleton 200 (180/20)
Top-1 Accuracy55.6
11
Skeleton-based Action RecognitionKinetics-400 (320/80)
Top-1 Accuracy39.6
11
Showing 10 of 19 rows

Other info

Follow for update