Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

About

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li• 2023

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5045.8
155
3D Question AnsweringScanQA (val)
CIDEr76.6
133
3D Visual GroundingNr3D (test)
Overall Success Rate64.2
88
3D Visual GroundingNr3D
Overall Success Rate64.2
74
3D Visual GroundingSr3D (test)
Overall Accuracy76.4
73
3D Question AnsweringScanQA w/ objects (test)
EM@127
55
3D Question AnsweringSQA3D (test)
EM@148.5
55
3D Question AnsweringScanQA w/o objects (test)
EM@123
51
3D Situated Question AnsweringSQA3D (test)
Average Accuracy48.5
40
3D Dense CaptioningScan2Cap (val)
CIDEr (@0.5)66.9
33
Showing 10 of 36 rows

Other info

Code

Follow for update