Multi3DRefer: Grounding Text Description to Multiple 3D Objects

About

We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering object proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.

Yiming Zhang, ZeMing Gong, Angel X. Chang• 2023

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5044.7	253
3D Visual Grounding	ScanRefer	Acc@0.544.7	142
3D Dense Captioning	Scan2Cap	CIDEr @0.538.4	106
3D Visual Grounding	Nr3D	Overall Success Rate49.4	97
3D Visual Grounding	ScanRefer v1 (test)	--	96
3D Visual Grounding	Nr3D (test)	Overall Success Rate49.4	88
Referring 3D Instance Segmentation	ScanRefer (val)	mIoU35.7	37
Visual Grounding	ScanRefer v1 (val)	Acc@0.5 (Unique)77.2	35
Multi-object 3D Visual Grounding	Multi3DRefer	F1@0.2542.8	30
3D Visual Grounding	Multi3DRefer (val)	F1@0.5038.4	29

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord