Look Ma, no markers: holistic performance capture without the hassle
About
We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to overcome these problems, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts. In this work, we introduce the first technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. Our approach produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing. We achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. We evaluate our method on a number of body, face and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human Mesh Recovery | MoYo | MPJPE60.15 | 16 | |
| 3D Human Pose Estimation | Chi3D | MPJPE46.47 | 15 | |
| Human Mesh Recovery | RICH | -- | 13 | |
| Human Pose Estimation | Harmony4D | PVE45.6 | 9 | |
| Hand Pose Estimation | FreiHAND (test) | PA-MPVPE8.1 | 7 | |
| 3D human mesh fitting | MammaEval-S | MPJPE25.97 | 5 | |
| 3D human mesh fitting | MammaEval-D | MPJPE27.98 | 5 | |
| 3D human reconstruction | Harmony4D + CHI3D + MammaEval-D (test) | Mean Perceptual Depth (mm)13.73 | 5 | |
| 2D Landmark Prediction | Harmony4D IoU > 0.5 | Mean 2D Euclidean Distance Error (pixels)31.45 | 4 | |
| 2D Landmark Prediction | RICH | Mean 2D Euclidean Distance Error (pixels)13.26 | 4 |