DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies
About
Precise human mesh recovery (HMR) from multi-view images remains challenging: end-to-end methods produce entangled errors hard to localize, while fitting-based methods rely on sparse keypoints that provide limited surface constraints. We observe that the true bottleneck lies in the quality of intermediate representations, and that dense pixel-to-surface correspondences can be effectively generated by repurposing pre-trained diffusion models with rich visual priors. We propose DiffProxy, a Stable-Diffusion-based framework trained on large-scale synthetic data with pixel-perfect annotations. A multi-conditional proxy generator predicts dense correspondences from multi-view images, providing uniform surface constraints that enable precise fitting. Hand refinement feeds enlarged hand crops alongside full-body images for fine-grained detail, while test-time scaling exploits diffusion stochasticity to estimate per-pixel uncertainty. Trained only on synthetic data, DiffProxy achieves state-of-the-art results on five diverse real-world benchmarks. Project page: https://wrk226.github.io/DiffProxy.html
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human Mesh Recovery | MPI-INF-3DHP | MPJPE42 | 35 | |
| Human Mesh Recovery | MoYo | MPJPE29.1 | 16 | |
| Human Mesh Recovery | RICH | PA-MPVPE27.6 | 13 | |
| Human Mesh Recovery | BEHAVE | PA-MPJPE22.7 | 7 | |
| Human Mesh Recovery | 4D-DRESS | PA-MPJPE17.3 | 7 | |
| Human Mesh Recovery | 4D-DRESS partial | PA-MPJPE22.7 | 7 |