Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
About
This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Question Answering | ScanQA (val) | CIDEr80 | 133 | |
| 3D Question Answering | SQA3D (test) | EM@154.2 | 55 | |
| 3D Situated Question Answering | SQA3D (test) | Average Accuracy54.2 | 40 | |
| 3D Question Answering | ScanQA v1.0 (test) | ROUGE40 | 26 | |
| Instruction Following | ALFRED (test-unseen) | GC33.75 | 23 | |
| 3D Dense Captioning | Scan2Cap | -- | 23 | |
| 3D Question Answering | ScanQA | C Score80 | 16 | |
| Embodied Task Completion | ALFRED seen (test) | Success Rate (SR)26.52 | 14 | |
| Situated 3D Question Answering | SQA3D (test) | EM@154.2 | 12 | |
| 3D Question Answering | SQA3D | EM53.6 | 11 |