DiffPose: Multi-hypothesis Human Pose Estimation using Diffusion models
About
Traditionally, monocular 3D human pose estimation employs a machine learning model to predict the most likely 3D pose for a given input image. However, a single image can be highly ambiguous and induces multiple plausible solutions for the 2D-3D lifting step which results in overly confident 3D pose predictors. To this end, we propose \emph{DiffPose}, a conditional diffusion model, that predicts multiple hypotheses for a given input image. In comparison to similar approaches, our diffusion model is straightforward and avoids intensive hyperparameter tuning, complex network structures, mode collapse, and unstable training. Moreover, we tackle a problem of the common two-step approach that first estimates a distribution of 2D joint locations via joint-wise heatmaps and consecutively approximates them based on first- or second-moment statistics. Since such a simplification of the heatmaps removes valid information about possibly correct, though labeled unlikely, joint locations, we propose to represent the heatmaps as a set of 2D joint candidate samples. To extract information about the original distribution from these samples we introduce our \emph{embedding transformer} that conditions the diffusion model. Experimentally, we show that DiffPose slightly improves upon the state of the art for multi-hypothesis pose estimation for simple poses and outperforms it by a large margin for highly ambiguous poses.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Human Pose Estimation | MPI-INF-3DHP (test) | PCK84.6 | 606 | |
| 3D Human Pose Estimation | Human3.6M (test) | MPJPE (Average)32 | 570 | |
| 3D Human Pose Estimation | Human3.6M (Protocol #1) | MPJPE (Avg.)43.3 | 457 | |
| 3D Human Pose Estimation | Human3.6M Standard Protocol | MPJPE44.2 | 19 | |
| 3D Human Pose Estimation | Human3.6M H36MA | MPJPE63.1 | 17 | |
| 3D Human Pose Estimation | Human 3.6M Subjects 9 & 11 (test) | MPJPE43.3 | 16 |