Learning Discriminative Representations for Skeleton Based Action Recognition
About
Human action recognition aims at classifying the category of human action from a segment of a video. Recently, people have dived into designing GCN-based models to extract features from skeletons for performing this task, because skeleton representations are much more efficient and robust than other modalities such as RGB frames. However, when employing the skeleton data, some important clues like related items are also discarded. It results in some ambiguous actions that are hard to be distinguished and tend to be misclassified. To alleviate this problem, we propose an auxiliary feature refinement head (FR Head), which consists of spatial-temporal decoupling and contrastive feature refinement, to obtain discriminative representations of skeletons. Ambiguous samples are dynamically discovered and calibrated in the feature space. Furthermore, FR Head could be imposed on different stages of GCNs to build a multi-level refinement for stronger supervision. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets. Our proposed models obtain competitive results from state-of-the-art methods and can help to discriminate those ambiguous samples. Codes are available at https://github.com/zhysora/FR-Head.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy90.9 | 661 | |
| Action Recognition | NTU RGB+D (Cross-View) | Accuracy95.3 | 609 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy96.8 | 575 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy92.8 | 467 | |
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy89.5 | 377 | |
| Action Recognition | NTU RGB-D Cross-Subject 60 | Accuracy92.8 | 305 | |
| Skeleton-based Action Recognition | NTU RGB+D (Cross-View) | Accuracy96.8 | 213 | |
| Skeleton-based Action Recognition | NTU RGB+D 120 (X-set) | Top-1 Accuracy90.9 | 184 | |
| Action Recognition | NTU RGB+D 120 Cross-Subject | Accuracy89.5 | 183 | |
| Action Recognition | NTU RGB+D X-View 60 | Accuracy96.8 | 172 |