Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos
About
Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | H2O (test) | Accuracy86.36 | 26 | |
| 3D Hand Pose Estimation | H2O | MPJPE Right35.63 | 14 | |
| 3D Hand Pose Estimation | H2O (test) | MEPE (Camera Space)35.02 | 8 | |
| 3D Hand Pose Estimation | H2O (same-domain) | MPJPE35.33 | 8 | |
| Action Recognition | FPHA (test) | Accuracy0.9409 | 6 | |
| 2D hand pose estimation | H2O (test) | PCK@0.284.75 | 6 | |
| Action Recognition | FPHA (standard) | Accuracy94.09 | 5 | |
| Hand Pose Estimation | FPHA (test) | -- | 3 |