Skeleton-based Action Recognition with Convolutional Neural Networks
About
Current state-of-the-art approaches to skeleton-based action recognition are mostly based on recurrent neural networks (RNN). In this paper, we propose a novel convolutional neural networks (CNN) based framework for both action classification and detection. Raw skeleton coordinates as well as skeleton motion are fed directly into CNN for label prediction. A novel skeleton transformer module is designed to rearrange and select important skeleton joints automatically. With a simple 7-layer network, we obtain 89.3% accuracy on validation set of the NTU RGB+D dataset. For action detection in untrimmed videos, we develop a window proposal network to extract temporal segment proposals, which are further classified within the same network. On the recent PKU-MMD dataset, we achieve 93.7% mAP, surpassing the baseline by a large margin.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D (Cross-View) | Accuracy89.3 | 609 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy89.3 | 575 | |
| Action Recognition | NTU RGB+D (Cross-subject) | Accuracy83.2 | 474 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy68.7 | 467 | |
| Action Recognition | NTU RGB-D Cross-Subject 60 | Accuracy83.2 | 305 | |
| Skeleton-based Action Recognition | NTU (Cross-Subject) | Accuracy83.2 | 86 | |
| Action Recognition | PKU-MMD Cross-view | Accuracy93.7 | 26 | |
| Action Recognition | PKU-MMD (XSub) | Top-1 Acc90.4 | 20 | |
| Gesture Recognition | ChaLearn Gesture Recognition dataset | F1-score0.912 | 16 | |
| Gesture Recognition | ChaLearn 2013 (test) | Accuracy91.2 | 14 |