Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

About

Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of-Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang• 2025

Related benchmarks

TaskDatasetResultRank
3D Question AnsweringMSQA
Count Accuracy47.9
25
3D Question AnsweringBeacon3D
Case Score58.9
23
QA-driven GroundingMSQA
F1@5052.1
3
QA-driven GroundingSQA3D
F1@5051.6
3
QA-driven GroundingScanQA
F1@5040.8
3
Question AnsweringSQA3D
EM-R39.7
3
Question AnsweringScanQA
EM-R21
3
Visual GroundingNr3D
Top-1 Accuracy57.7
3
Visual GroundingBeacon3D
Top-1 Accuracy67.8
3
Visual GroundingSQA3D-G
F1@5051.6
2
Showing 10 of 11 rows

Other info

Follow for update