BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition
About
Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy96.5 | 661 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy97 | 467 | |
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy95.5 | 377 | |
| Action Recognition | NTU RGB+D X-View 60 | Accuracy99.4 | 172 | |
| Skeleton-based Action Recognition | NTU-RGB+D 120 (Cross-setup) | Accuracy95.8 | 136 | |
| Skeleton-based Action Recognition | NTU RGB+D 60 (Cross-Subject) | Accuracy96.8 | 59 | |
| Action Recognition | N-UCLA Cross-View | Accuracy96.3 | 32 | |
| Skeleton Action Recognition | NTU RGB+D Cross-Subject (Xsub) 120 | Accuracy94.8 | 29 | |
| Action Recognition | PKU-MMD Cross-view | Accuracy98.7 | 26 | |
| Action Recognition | PKU-MMD (XSub) | Top-1 Acc97.5 | 20 |