Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

About

Graph convolutional networks have been widely used for skeleton-based action recognition due to their excellent modeling ability of non-Euclidean data. As the graph convolution is a local operation, it can only utilize the short-range joint dependencies and short-term trajectory but fails to directly model the distant joints relations and long-range temporal information that are vital to distinguishing various actions. To solve this problem, we present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module to enrich the receptive field of the model in spatial and temporal dimensions. Concretely, the MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolution, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-graph convolutions, and each node could complete multiple spatial and temporal aggregations with its neighborhoods. The final equivalent receptive field is accordingly enlarged, which is capable of capturing both short- and long-range dependencies in spatial and temporal domains. By coupling these two modules as a basic block, we further propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition. The proposed MST-GCN achieves remarkable performance on three challenging benchmark datasets, NTU RGB+D, NTU-120 RGB+D and Kinetics-Skeleton, for skeleton-based action recognition.

Zhan Chen, Sicheng Li, Bing Yang, Qinghan Li, Hong Liu• 2022

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D 120 (X-set)	Accuracy88.8	770
Action Recognition	NTU RGB+D (Cross-View)	Accuracy96.6	652
Action Recognition	NTU RGB+D 60 (Cross-View)	Accuracy96.6	601
Action Recognition	NTU RGB+D (Cross-subject)	Accuracy91.5	500
Action Recognition	NTU RGB+D 60 (X-sub)	Accuracy91.5	496
Action Recognition	NTU RGB+D X-sub 120	Accuracy87.5	473
Action Recognition	NTU RGB-D Cross-Subject 60	Accuracy91.5	358
Action Recognition	NTU-60 (xsub)	Accuracy91.5	251
Action Recognition	NTU RGB+D 120 Cross-Subject	Accuracy87.5	241
Action Recognition	NTU-120 (cross-subject (xsub))	Accuracy87.5	239

Showing 10 of 31 rows

Other info

Follow for update

@wizwand_team Discord