Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model

About

Recent advancements in multimodal large language models (LLMs) have demonstrated significant potential across various domains, particularly in concept reasoning. However, their applications in understanding 3D environments remain limited, primarily offering textual or numerical outputs without generating dense, informative segmentation masks. This paper introduces Reason3D, a novel LLM designed for comprehensive 3D understanding. Reason3D processes point cloud data and text prompts to produce textual responses and segmentation masks, enabling advanced tasks such as 3D reasoning segmentation, hierarchical searching, express referring, and question answering with detailed mask outputs. We propose a hierarchical mask decoder that employs a coarse-to-fine approach to segment objects within expansive scenes. It begins with a coarse location estimation, followed by object mask estimation, using two unique tokens predicted by LLMs based on the textual query. Experimental results on large-scale ScanNet and Matterport3D datasets validate the effectiveness of our Reason3D across various tasks.

Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, Ming-Hsuan Yang• 2024

Related benchmarks

Task	Dataset	Result
3D Question Answering	ScanQA (val)	CIDEr73.5	391
Referring 3D Instance Segmentation	ScanRefer (val)	mIoU74.6	43
3D Visual Grounding	ScanRefer (test)	Accuracy@0.2517.64	29
3D Referring Expression Segmentation	ScanRefer	Accuracy @ 0.2557.9	25
3D Referring Expression Segmentation	ScanRefer Overall 1 (val)	Acc@0.2557.9	9
3D reasoning segmentation	ScanNet V2 (val)	Acc@0.2543.21	8
3D reasoning segmentation	Matterport3D (val)	Acc@0.2531.22	8
3D Referring Expression Segmentation	ScanRefer Multiple	Acc@250.505	7
3D Segmentation	Reason3D	Acc@0.2543.21	7
Viewpoint-aware referring segmentation	Proposed dataset (full split)	mIoU3.21	5

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord