Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

About

Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh• 2024

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5046.9
155
3D Visual GroundingNr3D (test)
Overall Success Rate53.1
88
3D Visual GroundingNr3D
Overall Success Rate53.1
74
Visual GroundingScanRefer v1 (val)
Acc@0.5 (All)46.9
30
3D Visual GroundingScanRefer (test)--
21
3D Visual GroundingScanRefer v1 (test)
Unique Acc@0.5IoU69
15
Multi-object 3D groundingMulti3DRefer (val)
F1@0.5 (ZT, no D)82.4
6
Multi-object groundingMulti3DRefer (val)
F1@0.25 (ZT w/o D)82.4
3
Showing 8 of 8 rows

Other info

Code

Follow for update