LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
About
Recent advances in Large Multimodal Models (LMM) have made it possible for various applications in human-machine interactions. However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene. Existing works seek help from multi-view images, and project 2D features to 3D space as 3D scene representations. This, however, leads to huge computational overhead and performance degradation. In this paper, we present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. This help LMMs better comprehend human interactions and further help to remove the ambiguities in cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Question Answering | ScanQA (val) | METEOR15.91 | 217 | |
| Spatio-Temporal Reasoning | STCR | Accuracy39.6 | 168 | |
| 3D Dense Captioning | Scan2Cap | CIDEr @0.565.2 | 96 | |
| 3D Dense Captioning | Scan2Cap (val) | B-40.368 | 43 | |
| 3D Question Answering | ScanQA | -- | 38 | |
| 3D Question Answering | ScanQA v1.0 (test) | ROUGE35.9 | 26 | |
| 3D Dense Captioning | ScanRefer | CIDEr@0.5IoU65.19 | 21 | |
| 3D scene understanding | ScanQA | METEOR15.9 | 16 | |
| Scene Spatial Awareness QA | 3D-GRAND | Binary Accuracy53.45 | 14 | |
| 3D Dense Captioning | Nr3D | C Score (0.5 IoU)51.18 | 13 |