OCH3R: Object-Centric Holistic 3D Reconstruction

About

Object-centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi-stage pipelines that first apply pre-trained segmentors to extract individual objects, followed by per-object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object-Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per-pixel attributes, including CLIP-based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre-rendered canonical ground truth, avoiding costly per-image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state-of-the-art performance across monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, while producing high-fidelity, editable per-object reconstructions. Crucially, inference is fully feed-forward and scales independently of the number of objects, offering orders-of-magnitude speedups over conventional multi-stage pipelines in cluttered scenes.

Yi Du, Yang You, Xiang Wan, Leonidas Guibas• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	YCB-V	mIoU9.14	23
Depth Estimation	YCB-Video	Overall δ1 Accuracy69.96	16
Monocular Depth Estimation	NOCS (real)	Delta-1 Accuracy98.43	16
Monocular Depth Estimation	HOPE	Delta-1 Accuracy61.36	16
Monocular Metric Depth Estimation	PACE	Delta 1 Accuracy94.82	8
Semantic segmentation	PACE	mIoU7.34	8
Semantic segmentation	HOPE	mIoU6.9	8
Semantic segmentation	NOCS (real)	mIoU13.4	8
Semantic segmentation	OMNI	mIoU18.43	8
Monocular Metric Depth Estimation	OMNI	Delta-1 Accuracy42.77	8

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord