Exploiting temporal context for 3D human pose estimation in the wild
About
We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on single frames, we show that reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views of a person, yet the overall body shape does not change and 3D positions vary slowly. Our method improves not only on standard mocap-based datasets like Human 3.6M -- where we show quantitative improvements -- but also on challenging in-the-wild datasets such as Kinetics. Building upon our algorithm, we present a new dataset of more than 3 million frames of YouTube videos from Kinetics with automatically generated 3D poses and meshes. We show that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Human Pose Estimation | Human3.6M (test) | MPJPE (Average)77.8 | 547 | |
| 3D Human Pose Estimation | 3DPW (test) | PA-MPJPE72.2 | 505 | |
| 3D Human Pose Estimation | Human3.6M (Protocol 2) | -- | 315 | |
| 3D Human Mesh Recovery | 3DPW (test) | -- | 264 | |
| 3D Human Pose Estimation | Human3.6M | -- | 160 | |
| 3D Human Pose and Shape Estimation | 3DPW (test) | MPJPE-PA72.2 | 158 | |
| 3D Human Pose Estimation | Human3.6M Protocol #2 (test) | Average Error54.3 | 140 | |
| Human Mesh Recovery | 3DPW | PA-MPJPE72.2 | 123 | |
| 3D Human Mesh Recovery | Human3.6M (test) | -- | 120 | |
| 3D Human Pose and Shape Estimation | Human3.6M (test) | PA-MPJPE54.3 | 119 |