Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks
About
Recently, Convolutional Neural Networks (ConvNets) have shown promising performances in many computer vision tasks, especially image-based recognition. How to effectively use ConvNets for video-based recognition is still an open problem. In this paper, we propose a compact, effective yet simple method to encode spatio-temporal information carried in $3D$ skeleton sequences into multiple $2D$ images, referred to as Joint Trajectory Maps (JTM), and ConvNets are adopted to exploit the discriminative features for real-time human action recognition. The proposed method has been evaluated on three public benchmarks, i.e., MSRC-12 Kinect gesture dataset (MSRC-12), G3D dataset and UTD multimodal human action dataset (UTD-MHAD) and achieved the state-of-the-art results.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D (Cross-View) | Accuracy81.08 | 609 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy35.9 | 575 | |
| Action Recognition | NTU RGB+D (Cross-subject) | Accuracy76.32 | 474 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy39.1 | 467 | |
| Skeleton-based Action Recognition | NTU (Cross-Subject) | Accuracy73.4 | 86 | |
| Skeleton-based Action Recognition | NTU RGB+D Cross-View (CV) 1.0 | Accuracy75.2 | 38 | |
| Action Recognition | UTD-MHAD (cross-subject) | Accuracy87.9 | 36 | |
| Action Recognition | NTU RGB+D V2 (Cross Subject) | Accuracy73.4 | 16 | |
| Action Recognition | NTU RGB+D V2 (Cross View) | Accuracy75.2 | 16 | |
| Action Recognition | G3D (test) | Accuracy96.02 | 11 |