Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

About

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng• 2023

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF101--
365
3D Object ClassificationModelNet40 (test)
Accuracy78
302
3D Question AnsweringScanQA (val)--
133
Audio ClassificationESC50
Top-1 Acc83.4
64
Image ClassificationPlaces365
Top-1 Accuracy49.45
62
3D Shape RetrievalModelNet40 (test)
mAP63.23
38
RecognitionImageNet-1K
Top-1 Accuracy80.19
27
3D Object RecognitionShapeNet
Accuracy99.09
20
3D Object ClassificationObjaverse
Average Accuracy25.53
19
3D Object ClassificationModelNet40 (test)
Average Accuracy45.81
17
Showing 10 of 39 rows

Other info

Code

Follow for update