Open Vocabulary Monocular 3D Object Detection

About

We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categores in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability. We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (\eg, table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues. Our approach achieves state-of-the-art results in zero-shot 3D detection of novel categories as well as in-domain detection on seen classes. We hope our method provides a strong baseline and our evaluation protocol establishes a reliable benchmark for future research.

Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng• 2024

Related benchmarks

Task	Dataset	Result
3D Object Detection	nuScenes	--	41
3D Object Detection	OMNI3D	AP (3D)22.98	14
3D Object Detection	KITTI car class (val)	AP3D (IoU=0.5, Easy)51.23	11
Monocular 3D Object Detection (Car)	KITTI-360 29 (test)	AP3D (IoU=0.3, Easy)41.82	8
3D Object Detection	Waymo Vehicle class (val)	AP3D (All) @ IoU 0.55.46	8
3D Object Detection	Omni3D Full Unified (test)	AP 3D (Overall)22.98	7
3D Object Detection	Omni3D HYPERSIM domain	AP1558.87	6
3D Object Detection	Hypersim	IoU3D8.99	6
3D Object Detection	Omni3D NUSCENES domain	AP1525.45	6
3D Object Detection	Omni3D KITTI domain	AP157.75	6

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord