3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition
About
Many skeletal action recognition models use GCNs to represent the human body by 3D body joints connected body parts. GCNs aggregate one- or few-hop graph neighbourhoods, and ignore the dependency between not linked body joints. We propose to form hypergraph to model hyper-edges between graph nodes (e.g., third- and fourth-order hyper-edges capture three and four nodes) which help capture higher-order motion patterns of groups of body joints. We split action sequences into temporal blocks, Higher-order Transformer (HoT) produces embeddings of each temporal block based on (i) the body joints, (ii) pairwise links of body joints and (iii) higher-order hyper-edges of skeleton body joints. We combine such HoT embeddings of hyper-edges of orders 1, ..., r by a novel Multi-order Multi-mode Transformer (3Mformer) with two modules whose order can be exchanged to achieve coupled-mode attention on coupled-mode tokens based on 'channel-temporal block', 'order-channel-body joint', 'channel-hyper-edge (any order)' and 'channel-only' pairs. The first module, called Multi-order Pooling (MP), additionally learns weighted aggregation along the hyper-edge mode, whereas the second module, Temporal block Pooling (TP), aggregates along the temporal block mode. Our end-to-end trainable network yields state-of-the-art results compared to GCN-, transformer- and hypergraph-based counterparts.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy93.8 | 661 | |
| Action Recognition | NTU RGB+D 60 (Cross-View) | Accuracy98.7 | 575 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy94.8 | 467 | |
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy92 | 377 | |
| Action Recognition | NTU RGB-D Cross-Subject 60 | Accuracy94.8 | 305 | |
| Skeleton-based Action Recognition | NTU 60 (X-sub) | Accuracy94.8 | 220 | |
| Action Recognition | NTU RGB+D 120 Cross-Subject | -- | 183 | |
| Action Recognition | NTU RGB+D X-View 60 | Accuracy98.7 | 172 | |
| Skeleton-based Action Recognition | NTU 120 (X-sub) | Accuracy92 | 139 | |
| Skeleton-based Action Recognition | NTU-RGB+D 120 (Cross-setup) | Accuracy93.8 | 136 |