ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
About
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapellm/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Object Classification | ModelNet40 (test) | Accuracy93.6 | 302 | |
| Object Classification | ScanObjectNN OBJ_BG | Accuracy98.62 | 215 | |
| Object Classification | ScanObjectNN PB_T50_RS | Accuracy93.34 | 195 | |
| Object Classification | ScanObjectNN OBJ_ONLY | Overall Accuracy96.21 | 166 | |
| 3D Object Classification | Objaverse-LVIS (test) | Top-1 Accuracy49.6 | 95 | |
| Shape classification | ScanObjectNN PB_T50_RS | OA95.25 | 72 | |
| 3D Point Cloud Classification | ModelNet40 | Accuracy94.6 | 69 | |
| 3D Object Classification | ModelNet40 | -- | 62 | |
| 3D Object Classification | ModelNet40 few-shot | Accuracy99.5 | 60 | |
| 3D Captioning | Objaverse (test) | S-BERT Score48.52 | 28 |