UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video

About

We introduce UniCon3R, a unified feed-forward framework for online human-scene 4D reconstruction from monocular video. Current feed-forward human-scene reconstruction methods suffer from artifacts, where bodies float above the ground or penetrate parts of the scene. A key reason is the lack of effective interaction modelling between the human and the environment. Our goal is to exploit contact between the human and the scene during inference to actively improve the human mesh reconstruction. To that end, we explicitly model interaction by inferring 4D contact from the human pose and scene geometry and use the contact as a corrective cue for generating the pose. This enables UniCon3R to jointly recover scene geometry and spatially aligned 4D humans within the scene. Experiments on standard human-centric video benchmarks show that UniCon3R outperforms state-of-the-art baselines on physical plausibility and global human motion estimation while preserving fast, feed-forward inference speeds. The results validate our central claim: contact serves as a powerful internal prior, thus establishing a new paradigm for physically grounded joint human-scene reconstruction. Project page is available at https://surtantheta.github.io/UniCon3R .

Tanuj Sur, Shashank Tripathi, Nikos Athanasiou, Ha Linh Nguyen, Kai Xu, Michael J. Black, Angela Yao• 2026

Related benchmarks

Task	Dataset	Result
Global human motion estimation	RICH (test)	WA-MPJPE81.5	15
Global human motion estimation	EMDB-2 (test)	WA-MPJPE113.7	6
Binary SMPL-vertex contact prediction	RICH (test)	Precision64	5
Physical grounding	RICH 11 (test)	Collision Score7.71	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord