DiffPose: Multi-hypothesis Human Pose Estimation using Diffusion models

About

Traditionally, monocular 3D human pose estimation employs a machine learning model to predict the most likely 3D pose for a given input image. However, a single image can be highly ambiguous and induces multiple plausible solutions for the 2D-3D lifting step which results in overly confident 3D pose predictors. To this end, we propose \emph{DiffPose}, a conditional diffusion model, that predicts multiple hypotheses for a given input image. In comparison to similar approaches, our diffusion model is straightforward and avoids intensive hyperparameter tuning, complex network structures, mode collapse, and unstable training. Moreover, we tackle a problem of the common two-step approach that first estimates a distribution of 2D joint locations via joint-wise heatmaps and consecutively approximates them based on first- or second-moment statistics. Since such a simplification of the heatmaps removes valid information about possibly correct, though labeled unlikely, joint locations, we propose to represent the heatmaps as a set of 2D joint candidate samples. To extract information about the original distribution from these samples we introduce our \emph{embedding transformer} that conditions the diffusion model. Experimentally, we show that DiffPose slightly improves upon the state of the art for multi-hypothesis pose estimation for simple poses and outperforms it by a large margin for highly ambiguous poses.

Karl Holmquist, Bastian Wandt• 2022

Related benchmarks

Task	Dataset	Result
3D Human Pose Estimation	MPI-INF-3DHP (test)	PCK84.6	606
3D Human Pose Estimation	Human3.6M (test)	MPJPE (Average)32	570
3D Human Pose Estimation	Human3.6M (Protocol #1)	MPJPE (Avg.)43.3	457
3D Human Pose Estimation	Human3.6M Standard Protocol	MPJPE44.2	19
3D Human Pose Estimation	Human3.6M H36MA	MPJPE63.1	17
3D Human Pose Estimation	Human 3.6M Subjects 9 & 11 (test)	MPJPE43.3	16

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord