Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention
About
Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Visual Grounding | ScanRefer (val) | Overall Accuracy @ IoU 0.5046.9 | 155 | |
| 3D Visual Grounding | Nr3D (test) | Overall Success Rate53.1 | 88 | |
| 3D Visual Grounding | Nr3D | Overall Success Rate53.1 | 74 | |
| Visual Grounding | ScanRefer v1 (val) | Acc@0.5 (All)46.9 | 30 | |
| 3D Visual Grounding | ScanRefer (test) | -- | 21 | |
| 3D Visual Grounding | ScanRefer v1 (test) | Unique Acc@0.5IoU69 | 15 | |
| Multi-object 3D grounding | Multi3DRefer (val) | F1@0.5 (ZT, no D)82.4 | 6 | |
| Multi-object grounding | Multi3DRefer (val) | F1@0.25 (ZT w/o D)82.4 | 3 |