SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

About

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang• 2024

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5048.1	262
3D Question Answering	SQA3D (test)	EM@149.9	197
3D Visual Grounding	Nr3D	Overall Success Rate64.9	109
3D Visual Grounding	Nr3D (test)	Overall Success Rate64.9	88
3D Visual Grounding	Sr3D (test)	Overall Accuracy77.5	73
3D Question Answering	ScanQA w/ objects (test)	EM@125	55
3D Question Answering	ScanQA w/o objects (test)	EM@123.5	51
3D Visual Question Answering	SQA3D	EM@149.9	30
3D Visual Grounding	ScanRefer (test)	--	29
3D Visual Grounding	Sr3D	Overall Accuracy77.5	27

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord