From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper
About
In multi-view 3D human pose estimation, models typically rely on images captured simultaneously from different camera views to predict a pose at a specific moment. While providing accurate spatial information, this traditional approach often overlooks the rich temporal dependencies between adjacent frames. We propose a novel 3D human pose estimation input method: the sparse interleaved input to address this. This method leverages images captured from different camera views at various time points (e.g., View 1 at time $t$ and View 2 at time $t+\delta$), allowing our model to capture rich spatio-temporal information and effectively boost performance. More importantly, this approach offers two key advantages: First, it can theoretically increase the output pose frame rate by N times with N cameras, thereby breaking through single-view frame rate limitations and enhancing the temporal resolution of the production. Second, using a sparse subset of available frames, our method can reduce data redundancy and simultaneously achieve better performance. We introduce the DenseWarper model, which leverages epipolar geometry for efficient spatio-temporal heatmap exchange. We conducted extensive experiments on the Human3.6M and MPI-INF-3DHP datasets. Results demonstrate that our method, utilizing only sparse interleaved images as input, outperforms traditional dense multi-view input approaches and achieves state-of-the-art performance. The source code for this work is available at: https://github.com/lingli1724/DenseWarper-ICLR2026
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Human Pose Estimation | Human3.6M (test) | MPJPE (Average)19.4 | 570 | |
| 3D Human Pose Estimation | MPI-INF-3DHP | MPJPE65.89 | 122 | |
| 3D Human Pose Estimation | Human3.6M 2D SimpleBaseline (test) | MPJPE Error (Direction)21.2 | 11 | |
| 3D Human Pose Estimation | Human3.6M 2D Ground Truth (test) | Dir.23.2 | 11 | |
| 3D Human Pose Estimation | Human3.6M 2D CPN (test) | Average Performance Score33.6 | 9 | |
| 3D Human Pose Estimation | Human3.6M | Efficiency per MB0.291 | 8 |