Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions
About
Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes. Our code and data are publicly available at https://github.com/jintangxue/Descrip3D.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Question Answering | ScanQA (val) | CIDEr93.7 | 290 | |
| 3D Dense Captioning | Scan2Cap (val) | -- | 43 | |
| Visual Grounding | ScanRefer v1 (val) | Acc@0.5 (Unique)83.2 | 35 | |
| 3D Visual Grounding | Multi3DRefer (val) | F1@0.5055.1 | 29 | |
| 3D Question Answering | SQA3D (val) | EM55.7 | 12 |