SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
About
Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of-Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Question Answering | MSQA | Count Accuracy47.9 | 25 | |
| 3D Question Answering | Beacon3D | Case Score58.9 | 23 | |
| QA-driven Grounding | MSQA | F1@5052.1 | 3 | |
| QA-driven Grounding | SQA3D | F1@5051.6 | 3 | |
| QA-driven Grounding | ScanQA | F1@5040.8 | 3 | |
| Question Answering | SQA3D | EM-R39.7 | 3 | |
| Question Answering | ScanQA | EM-R21 | 3 | |
| Visual Grounding | Nr3D | Top-1 Accuracy57.7 | 3 | |
| Visual Grounding | Beacon3D | Top-1 Accuracy67.8 | 3 | |
| Visual Grounding | SQA3D-G | F1@5051.6 | 2 |