ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

About

We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language expressions with geometric features, enabling regression of the 3D bounding box of a target object. We also introduce the ScanRefer dataset, containing 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.

Dave Zhenyu Chen, Angel X. Chang, Matthias Nie{\ss}ner• 2019

Related benchmarks

Task	Dataset	Result
3D Question Answering	ScanQA (val)	CIDEr64.9	290
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5043.31	253
3D Visual Grounding	ScanRefer	Acc@0.526.1	142
3D Visual Grounding	Nr3D	Overall Success Rate34.2	97
3D Visual Grounding	ScanRefer v1 (test)	--	96
3D Visual Grounding	Nr3D (test)	Overall Success Rate34.2	88
3D Question Answering	ScanQA w/ objects (test)	EM@120.56	55
3D Question Answering	ScanQA w/o objects (test)	EM@119.04	51
3D Question Answering	ScanQA	EM (Exact Match)17.3	48
3D Visual Grounding	ScanRefer Overall	Acc @ 0.2539	41

Showing 10 of 33 rows

Other info

Follow for update

@wizwand_team Discord