Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

About

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapellm/

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma• 2024

Related benchmarks

TaskDatasetResultRank
3D Object ClassificationModelNet40 (test)
Accuracy93.6
302
Object ClassificationScanObjectNN OBJ_BG
Accuracy98.62
215
Object ClassificationScanObjectNN PB_T50_RS
Accuracy93.34
195
Object ClassificationScanObjectNN OBJ_ONLY
Overall Accuracy96.21
166
3D Object ClassificationObjaverse-LVIS (test)
Top-1 Accuracy49.6
95
Shape classificationScanObjectNN PB_T50_RS
OA95.25
72
3D Point Cloud ClassificationModelNet40
Accuracy94.6
69
3D Object ClassificationModelNet40--
62
3D Object ClassificationModelNet40 few-shot
Accuracy99.5
60
3D CaptioningObjaverse (test)
S-BERT Score48.52
28
Showing 10 of 33 rows

Other info

Code

Follow for update