Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Grounded 3D-LLM with Referent Tokens

About

Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling it to handle sequences that interleave 3D and textual data. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats. To facilitate the use of referent tokens in subsequent language modeling, we provide a large-scale, automatically curated grounded scene-text dataset with over 1 million phrase-to-region correspondences and introduce Contrastive Language-Scene Pre-training (CLASP) to perform phrase-level scene-text alignment using this data. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D question answering, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets are available at the https://groundedscenellm.github.io/grounded_3d-llm.github.io.

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan Lyu, Dahua Lin, Jiangmiao Pang• 2024

Related benchmarks

TaskDatasetResultRank
3D Question AnsweringScanQA (val)
CIDEr72.7
290
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5044.1
253
3D Visual GroundingScanRefer
Acc@0.544.1
142
3D Dense CaptioningScan2Cap
CIDEr @0.570.6
106
3D Visual GroundingNr3D
Overall Success Rate32.8
97
3D Question AnsweringScanQA--
48
3D Dense CaptioningScan2Cap (val)
B-40.355
43
Visual GroundingScanRefer v1 (val)--
35
Multi-object 3D Visual GroundingMulti3DRefer
F1@0.2545.2
30
3D Visual GroundingMulti3DRefer (val)
F1@0.5040.8
29
Showing 10 of 17 rows

Other info

Follow for update