Real-time 3D human action recognition based on Hyperpoint sequence
About
Real-time 3D human action recognition has broad industrial applications, such as surveillance, human-computer interaction, and healthcare monitoring. By relying on complex spatio-temporal local encoding, most existing point cloud sequence networks capture spatio-temporal local structures to recognize 3D human actions. To simplify the point cloud sequence modeling task, we propose a lightweight and effective point cloud sequence network referred to as SequentialPointNet for real-time 3D action recognition. Instead of capturing spatio-temporal local structures, SequentialPointNet encodes the temporal evolution of static appearances to recognize human actions. Firstly, we define a novel type of point data, Hyperpoint, to better describe the temporally changing human appearances. A theoretical foundation is provided to clarify the information equivalence property for converting point cloud sequences into Hyperpoint sequences. Secondly, the point cloud sequence modeling task is decomposed into a Hyperpoint embedding task and a Hyperpoint sequence modeling task. Specifically, for Hyperpoint embedding, the static point cloud technology is employed to convert point cloud sequences into Hyperpoint sequences, which introduces inherent frame-level parallelism; for Hyperpoint sequence modeling, a Hyperpoint-Mixer module is designed as the basic building block to learning the spatio-temporal features of human actions. Extensive experiments on three widely-used 3D action recognition datasets demonstrate that the proposed SequentialPointNet achieves competitive classification performance with up to 10X faster than existing approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy95.4 | 770 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy97.6 | 601 | |
| Action Recognition | NTU RGB-D Cross-Subject 60 | Accuracy90.3 | 358 | |
| Action Recognition | NTU RGB+D 120 Cross-Subject | Accuracy83.5 | 241 | |
| Action Recognition | MSRAction3D | Accuracy92.64 | 176 | |
| Action Recognition | NTU RGB+D | Accuracy90.3 | 50 | |
| Action Recognition | UTD-MHAD | Accuracy92.31 | 8 |