Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

About

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang• 2024

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5048.1
192
3D Question AnsweringSQA3D (test)
EM@149.9
98
3D Visual GroundingNr3D (test)
Overall Success Rate64.9
88
3D Visual GroundingNr3D
Overall Success Rate64.9
83
3D Visual GroundingSr3D (test)
Overall Accuracy77.5
73
3D Question AnsweringScanQA w/ objects (test)
EM@125
55
3D Question AnsweringScanQA w/o objects (test)
EM@123.5
51
3D Question AnsweringBeacon3D
Case Score40.3
23
3D Visual GroundingSr3D
Overall Accuracy77.5
15
3D Visual GroundingNr3D without GT object class
Easy Success72.5
13
Showing 10 of 24 rows

Other info

Follow for update