Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Grounded 3D-LLM with Referent Tokens

About

Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling it to handle sequences that interleave 3D and textual data. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats. To facilitate the use of referent tokens in subsequent language modeling, we provide a large-scale, automatically curated grounded scene-text dataset with over 1 million phrase-to-region correspondences and introduce Contrastive Language-Scene Pre-training (CLASP) to perform phrase-level scene-text alignment using this data. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D question answering, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets are available at the https://groundedscenellm.github.io/grounded_3d-llm.github.io.

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan Lyu, Dahua Lin, Jiangmiao Pang• 2024

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5044.1
155
3D Question AnsweringScanQA (val)
CIDEr72.7
133
3D Dense CaptioningScan2Cap (val)
CIDEr (@0.5)0.706
33
Visual GroundingScanRefer v1 (val)--
30
3D Question AnsweringScanQA v1.0 (test)
ROUGE72.7
26
3D Dense CaptioningScan2Cap
BLEU-4 @0.535.5
23
3D Visual GroundingScanRefer
Acc@0.2547.9
23
3D Question AnsweringScanQA
C Score72.7
16
3D Visual GroundingMulti3DRefer (val)
F1@0.5040.6
14
Multi-object 3D Visual GroundingMulti3DRefer
F1@0.2545.2
8
Showing 10 of 11 rows

Other info

Follow for update