Grounded 3D-LLM with Referent Tokens
About
Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling it to handle sequences that interleave 3D and textual data. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats. To facilitate the use of referent tokens in subsequent language modeling, we provide a large-scale, automatically curated grounded scene-text dataset with over 1 million phrase-to-region correspondences and introduce Contrastive Language-Scene Pre-training (CLASP) to perform phrase-level scene-text alignment using this data. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D question answering, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets are available at the https://groundedscenellm.github.io/grounded_3d-llm.github.io.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Visual Grounding | ScanRefer (val) | Overall Accuracy @ IoU 0.5044.1 | 155 | |
| 3D Question Answering | ScanQA (val) | CIDEr72.7 | 133 | |
| 3D Dense Captioning | Scan2Cap (val) | CIDEr (@0.5)0.706 | 33 | |
| Visual Grounding | ScanRefer v1 (val) | -- | 30 | |
| 3D Question Answering | ScanQA v1.0 (test) | ROUGE72.7 | 26 | |
| 3D Dense Captioning | Scan2Cap | BLEU-4 @0.535.5 | 23 | |
| 3D Visual Grounding | ScanRefer | Acc@0.2547.9 | 23 | |
| 3D Question Answering | ScanQA | C Score72.7 | 16 | |
| 3D Visual Grounding | Multi3DRefer (val) | F1@0.5040.6 | 14 | |
| Multi-object 3D Visual Grounding | Multi3DRefer | F1@0.2545.2 | 8 |