UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video
About
We introduce UniCon3R, a unified feed-forward framework for online human-scene 4D reconstruction from monocular video. Current feed-forward human-scene reconstruction methods suffer from artifacts, where bodies float above the ground or penetrate parts of the scene. A key reason is the lack of effective interaction modelling between the human and the environment. Our goal is to exploit contact between the human and the scene during inference to actively improve the human mesh reconstruction. To that end, we explicitly model interaction by inferring 4D contact from the human pose and scene geometry and use the contact as a corrective cue for generating the pose. This enables UniCon3R to jointly recover scene geometry and spatially aligned 4D humans within the scene. Experiments on standard human-centric video benchmarks show that UniCon3R outperforms state-of-the-art baselines on physical plausibility and global human motion estimation while preserving fast, feed-forward inference speeds. The results validate our central claim: contact serves as a powerful internal prior, thus establishing a new paradigm for physically grounded joint human-scene reconstruction. Project page is available at https://surtantheta.github.io/UniCon3R .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Global human motion estimation | RICH (test) | WA-MPJPE81.5 | 6 | |
| Global human motion estimation | EMDB-2 (test) | WA-MPJPE113.7 | 6 | |
| Binary SMPL-vertex contact prediction | RICH (test) | Precision64 | 5 | |
| Physical grounding | RICH 11 (test) | Collision Score7.71 | 5 |