Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Grounded 3D-LLM with Referent Tokens

About

Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling it to handle sequences that interleave 3D and textual data. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats. To facilitate the use of referent tokens in subsequent language modeling, we provide a large-scale, automatically curated grounded scene-text dataset with over 1 million phrase-to-region correspondences and introduce Contrastive Language-Scene Pre-training (CLASP) to perform phrase-level scene-text alignment using this data. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D question answering, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets are available at the https://groundedscenellm.github.io/grounded_3d-llm.github.io.

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan Lyu, Dahua Lin, Jiangmiao Pang• 2024

Related benchmarks

TaskDatasetResultRank
3D Question AnsweringScanQA (val)--
217
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5044.1
192
3D Visual GroundingScanRefer
Acc@0.544.1
142
3D Dense CaptioningScan2Cap
CIDEr @0.570.6
96
3D Dense CaptioningScan2Cap (val)
B-40.355
43
3D Question AnsweringScanQA--
38
Visual GroundingScanRefer v1 (val)--
30
3D Question AnsweringScanQA v1.0 (test)
ROUGE72.7
26
Multi-object 3D Visual GroundingMulti3DRefer
F1@0.2545.2
24
3D Visual GroundingScanRefer (test)--
21
Showing 10 of 14 rows

Other info

Follow for update