MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction
About
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Geometry Reconstruction | PartNet-Mobility All 46 classes | CD125 | 4 | |
| Kinematic Prediction | PartNet-Mobility All 46 classes | Type Accuracy67.47 | 4 | |
| Geometry Reconstruction | PartNet-Mobility Partial 7 classes | CD (x10^-2)0.77 | 3 | |
| Kinematic Prediction | PartNet-Mobility Partial 7 classes | Type Accuracy88.26 | 3 |