Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition

About

Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies -- improving from 86.4\% to 93.0\% in hand-intensive actions -- while maintaining fewer GFLOPs and parameters than the relevant unified methods.

Seungyeon Cho, Tae-Kyun Kim• 2025

Related benchmarks

TaskDatasetResultRank
Action RecognitionNTU RGB+D 120 (X-set)
Accuracy96
661
Action RecognitionNTU RGB+D 60 (X-sub)
Accuracy96.3
467
Action RecognitionNTU RGB+D X-sub 120
Accuracy95.1
377
Action RecognitionNTU RGB+D X-View 60
Accuracy99
172
Skeleton-based Action RecognitionNTU-RGB+D 120 (Cross-setup)
Accuracy95
136
Skeleton-based Action RecognitionNTU RGB+D 60 (Cross-Subject)
Accuracy96.2
59
Action RecognitionN-UCLA Cross-View
Accuracy94.6
32
Skeleton Action RecognitionNTU RGB+D Cross-Subject (Xsub) 120
Accuracy94.3
29
Action RecognitionPKU-MMD Cross-view
Accuracy97.9
26
Action RecognitionPKU-MMD (XSub)
Top-1 Acc96.9
20
Showing 10 of 11 rows

Other info

Follow for update