NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis
About
Recent approaches in depth-based human activity analysis achieved outstanding performance and proved the effectiveness of 3D representation for classification of action classes. Currently available depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of training samples, distinct class labels, camera views and variety of subjects. In this paper we introduce a large-scale dataset for RGB+D human action recognition with more than 56 thousand video samples and 4 million frames, collected from 40 distinct subjects. Our dataset contains 60 different action classes including daily, mutual, and health-related actions. In addition, we propose a new recurrent neural network structure to model the long-term temporal correlation of the features for each body part, and utilize them for better action classification. Experimental results show the advantages of applying deep learning methods over state-of-the-art hand-crafted features on the suggested cross-subject and cross-view evaluation criteria for our dataset. The introduction of this large scale dataset will enable the community to apply, develop and adapt various data-hungry learning techniques for the task of depth-based and RGB+D-based human activity analysis.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy26.3 | 661 | |
| Action Recognition | NTU RGB+D (Cross-View) | Accuracy70.3 | 609 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy70.3 | 575 | |
| Action Recognition | NTU RGB+D (Cross-subject) | Accuracy74.9 | 474 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy62.93 | 467 | |
| Action Recognition | Kinetics-400 | Top-1 Acc16.4 | 413 | |
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy26.3 | 377 | |
| Action Recognition | NTU RGB-D Cross-Subject 60 | Accuracy62.9 | 305 | |
| Skeleton-based Action Recognition | NTU RGB+D (Cross-View) | Accuracy70.3 | 213 | |
| Action Recognition | NTU RGB+D 120 Cross-Subject | Accuracy25.5 | 183 |