OCH3R: Object-Centric Holistic 3D Reconstruction
About
Object-centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi-stage pipelines that first apply pre-trained segmentors to extract individual objects, followed by per-object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object-Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per-pixel attributes, including CLIP-based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre-rendered canonical ground truth, avoiding costly per-image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state-of-the-art performance across monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, while producing high-fidelity, editable per-object reconstructions. Crucially, inference is fully feed-forward and scales independently of the number of objects, offering orders-of-magnitude speedups over conventional multi-stage pipelines in cluttered scenes.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | YCB-V | mIoU9.14 | 23 | |
| Depth Estimation | YCB-Video | Overall δ1 Accuracy69.96 | 16 | |
| Monocular Depth Estimation | NOCS (real) | Delta-1 Accuracy98.43 | 16 | |
| Monocular Depth Estimation | HOPE | Delta-1 Accuracy61.36 | 16 | |
| Monocular Metric Depth Estimation | PACE | Delta 1 Accuracy94.82 | 8 | |
| Semantic segmentation | PACE | mIoU7.34 | 8 | |
| Semantic segmentation | HOPE | mIoU6.9 | 8 | |
| Semantic segmentation | NOCS (real) | mIoU13.4 | 8 | |
| Semantic segmentation | OMNI | mIoU18.43 | 8 | |
| Monocular Metric Depth Estimation | OMNI | Delta-1 Accuracy42.77 | 8 |