MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

About

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu• 2026

Related benchmarks

Task	Dataset	Result
3D Generation	PhysX-Bench 1.0 (test)	CLIP Alignment Score83.5	5
Simulation-ready 3D Generation	PhysXVerse	PSNR19.68	5
Simulation-ready 3D Generation	PhysX-Mobility	PSNR16.46	5
Geometry Reconstruction	PartNet-Mobility All 46 classes	CD125	4
Kinematic Prediction	PartNet-Mobility All 46 classes	Type Accuracy67.47	4
Geometry Reconstruction	PartNet-Mobility Partial 7 classes	CD (x10^-2)0.77	3
Kinematic Prediction	PartNet-Mobility Partial 7 classes	Type Accuracy88.26	3

Showing 7 of 7 rows

Other info

GitHub

Follow for update

@wizwand_team Discord