Vertex Feature Encoding and Hierarchical Temporal Modeling in a Spatial-Temporal Graph Convolutional Network for Action Recognition

About

This paper extends the Spatial-Temporal Graph Convolutional Network (ST-GCN) for skeleton-based action recognition by introducing two novel modules, namely, the Graph Vertex Feature Encoder (GVFE) and the Dilated Hierarchical Temporal Convolutional Network (DH-TCN). On the one hand, the GVFE module learns appropriate vertex features for action recognition by encoding raw skeleton data into a new feature space. On the other hand, the DH-TCN module is capable of capturing both short-term and long-term temporal dependencies using a hierarchical dilated convolutional network. Experiments have been conducted on the challenging NTU RGB-D-60 and NTU RGB-D 120 datasets. The obtained results show that our method competes with state-of-the-art approaches while using a smaller number of layers and parameters; thus reducing the required training time and memory.

Konstantinos Papadopoulos, Enjie Ghorbel, Djamila Aouada, Bj\"orn Ottersten• 2019

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D 120 (X-set)	Accuracy79.8	779
Action Recognition	NTU RGB+D 60 (Cross-View)	Accuracy92.8	601
Action Recognition	NTU RGB+D 60 (X-sub)	Accuracy85.3	496
Action Recognition	NTU RGB+D X-sub 120	Accuracy78.3	482
Action Recognition	NTU RGB+D 120 Cross-Subject	Accuracy78.3	249
Action Recognition	NTU 120 (Cross-Setup)	Accuracy79.8	231
Skeleton-based Action Recognition	NTU RGB+D (Cross-View)	Accuracy92.8	213
Skeleton-based Action Recognition	NTU 120 (X-sub)	Accuracy78.3	153
Skeleton-based Action Recognition	NTU RGB+D 120 Cross-Subject	Top-1 Accuracy78.3	143
Skeleton-based Action Recognition	NTU-RGB+D 120 (Cross-setup)	Accuracy79.8	136

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord