Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

About

Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes. Our code and data are publicly available at https://github.com/jintangxue/Descrip3D.

Jintang Xue, Ganning Zhao, Jie-En Yao, Hong-En Chen, Yue Hu, Meida Chen, Suya You, C.-C. Jay Kuo• 2025

Related benchmarks

TaskDatasetResultRank
3D Question AnsweringScanQA (val)
CIDEr93.7
290
3D Dense CaptioningScan2Cap (val)--
43
Visual GroundingScanRefer v1 (val)
Acc@0.5 (Unique)83.2
35
3D Visual GroundingMulti3DRefer (val)
F1@0.5055.1
29
3D Question AnsweringSQA3D (val)
EM55.7
12
Showing 5 of 5 rows

Other info

Follow for update