Grounded 3D-LLM with Referent Tokens

About

Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling it to handle sequences that interleave 3D and textual data. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats. To facilitate the use of referent tokens in subsequent language modeling, we provide a large-scale, automatically curated grounded scene-text dataset with over 1 million phrase-to-region correspondences and introduce Contrastive Language-Scene Pre-training (CLASP) to perform phrase-level scene-text alignment using this data. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D question answering, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets are available at the https://groundedscenellm.github.io/grounded_3d-llm.github.io.

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan Lyu, Dahua Lin, Jiangmiao Pang• 2024

Related benchmarks

Task	Dataset	Result
3D Question Answering	ScanQA (val)	CIDEr72.7	290
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5044.1	253
3D Visual Grounding	ScanRefer	Acc@0.544.1	142
3D Dense Captioning	Scan2Cap	CIDEr @0.570.6	106
3D Visual Grounding	Nr3D	Overall Success Rate32.8	97
3D Question Answering	ScanQA	--	48
3D Dense Captioning	Scan2Cap (val)	B-40.355	43
Visual Grounding	ScanRefer v1 (val)	--	35
Multi-object 3D Visual Grounding	Multi3DRefer	F1@0.2545.2	30
3D Visual Grounding	Multi3DRefer (val)	F1@0.5040.8	29

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord