DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition
About
Graph convolution networks (GCN) have been widely used in skeleton-based action recognition. We note that existing GCN-based approaches primarily rely on prescribed graphical structures (ie., a manually defined topology of skeleton joints), which limits their flexibility to capture complicated correlations between joints. To move beyond this limitation, we propose a new framework for skeleton-based action recognition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN). It consists of two modules, DG-GCN and DG-TCN, respectively, for spatial and temporal modeling. In particular, DG-GCN uses learned affinity matrices to capture dynamic graphical structures instead of relying on a prescribed one, while DG-TCN performs group-wise temporal convolutions with varying receptive fields and incorporates a dynamic joint-skeleton fusion module for adaptive multi-level temporal modeling. On a wide range of benchmarks, including NTURGB+D, Kinetics-Skeleton, BABEL, and Toyota SmartHome, DG-STGCN consistently outperforms state-of-the-art methods, often by a notable margin.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy91.4 | 661 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy98.6 | 575 | |
| Action Recognition | Kinetics-400 | Top-1 Acc40.3 | 413 | |
| Action Recognition | NTU RGB-D Cross-Subject 60 | Accuracy94.1 | 305 | |
| Action Recognition | Kinetics 400 (test) | -- | 245 | |
| Action Recognition | NTU RGB+D 120 Cross-Subject | Accuracy89.6 | 183 | |
| Action Recognition | NTU 120 (Cross-Setup) | Accuracy91.3 | 112 | |
| Action Recognition | Toyota Smarthome CS | Accuracy65.1 | 58 | |
| Action Recognition | Toyota SmartHome (TSH) (CV1) | Accuracy41.8 | 54 | |
| Action Recognition | NTU120 (cross-subject (CS)) | Top-1 Accuracy89.6 | 36 |