Vertex Feature Encoding and Hierarchical Temporal Modeling in a Spatial-Temporal Graph Convolutional Network for Action Recognition
About
This paper extends the Spatial-Temporal Graph Convolutional Network (ST-GCN) for skeleton-based action recognition by introducing two novel modules, namely, the Graph Vertex Feature Encoder (GVFE) and the Dilated Hierarchical Temporal Convolutional Network (DH-TCN). On the one hand, the GVFE module learns appropriate vertex features for action recognition by encoding raw skeleton data into a new feature space. On the other hand, the DH-TCN module is capable of capturing both short-term and long-term temporal dependencies using a hierarchical dilated convolutional network. Experiments have been conducted on the challenging NTU RGB-D-60 and NTU RGB-D 120 datasets. The obtained results show that our method competes with state-of-the-art approaches while using a smaller number of layers and parameters; thus reducing the required training time and memory.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy79.8 | 661 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy92.8 | 575 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy85.3 | 467 | |
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy78.3 | 377 | |
| Skeleton-based Action Recognition | NTU RGB+D (Cross-View) | Accuracy92.8 | 213 | |
| Action Recognition | NTU RGB+D 120 Cross-Subject | Accuracy78.3 | 183 | |
| Skeleton-based Action Recognition | NTU RGB+D 120 Cross-Subject | Top-1 Accuracy78.3 | 143 | |
| Skeleton-based Action Recognition | NTU 120 (X-sub) | Accuracy78.3 | 139 | |
| Skeleton-based Action Recognition | NTU-RGB+D 120 (Cross-setup) | Accuracy79.8 | 136 | |
| Skeleton-based Action Recognition | NTU RGB+D (Cross-subject) | Accuracy85.3 | 123 |