Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
About
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | UCF101 | -- | 365 | |
| 3D Object Classification | ModelNet40 (test) | Accuracy78 | 302 | |
| 3D Question Answering | ScanQA (val) | -- | 133 | |
| Audio Classification | ESC50 | Top-1 Acc83.4 | 64 | |
| Image Classification | Places365 | Top-1 Accuracy49.45 | 62 | |
| 3D Shape Retrieval | ModelNet40 (test) | mAP63.23 | 38 | |
| Recognition | ImageNet-1K | Top-1 Accuracy80.19 | 27 | |
| 3D Object Recognition | ShapeNet | Accuracy99.09 | 20 | |
| 3D Object Classification | Objaverse | Average Accuracy25.53 | 19 | |
| 3D Object Classification | ModelNet40 (test) | Average Accuracy45.81 | 17 |