Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

About

Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in general referencing and grounding capabilities for intricate scene comprehension. In this paper, we introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Specifically, we decompose the input 3D scene into a set of object proposals, each assigned a unique identifier token, which enables efficient object referencing and grounding during user-assistant interactions. Given the scarcity of scene-language data, we model the scene embeddings as a sequence of explicit object-level embeddings, derived from semantic-rich 2D or 3D representations. By employing object identifiers, we transform diverse 3D scene-language tasks into a unified question-answering format, facilitating joint training without the need for additional task-specific heads. With minimal fine-tuning on all downstream tasks, our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, Zhou Zhao• 2023

Related benchmarks

Task	Dataset	Result
3D Question Answering	ScanQA (val)	CIDEr87.7	391
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5050.23	262
3D Question Answering	SQA3D (test)	EM@154.7	197
3D Visual Grounding	ScanRefer	Acc@0.2557.5	172
Spatio-Temporal Reasoning	STCR	Accuracy51	168
3D Dense Captioning	Scan2Cap	CIDEr @0.577.1	127
3D Visual Grounding	Nr3D	Overall Success Rate63.6	109
3D Question Answering	SQA3D	EM54.7	69
3D Visual Grounding	ScanRefer Overall	Acc @ 0.550.2	55
3D Dense Captioning	Scan2Cap (val)	CIDEr (@0.5)77.19	53

Showing 10 of 60 rows

Other info

Code

Follow for update

@wizwand_team Discord