Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Open Vocabulary Monocular 3D Object Detection

About

We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categores in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability. We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (\eg, table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues. Our approach achieves state-of-the-art results in zero-shot 3D detection of novel categories as well as in-domain detection on seen classes. We hope our method provides a strong baseline and our evaluation protocol establishes a reliable benchmark for future research.

Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng• 2024

Related benchmarks

TaskDatasetResultRank
3D Object DetectionnuScenes--
19
3D Object DetectionOMNI3D
AP (3D)22.98
14
3D Object DetectionOmni3D Full Unified (test)
AP 3D (Overall)22.98
7
3D Object DetectionOmni3D HYPERSIM domain
AP1558.87
6
3D Object DetectionHypersim
IoU3D8.99
6
3D Object DetectionOmni3D NUSCENES domain
AP1525.45
6
3D Object DetectionOmni3D KITTI domain
AP157.75
6
3D Object DetectionKITTI
IoU3D38.12
6
3D Object DetectionSUN RGB-D
IoU3D24.19
6
3D Object DetectionOmni3D ARKITSCENES domain
AP@0.1515.2
6
Showing 10 of 14 rows

Other info

Follow for update