Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

About

Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh• 2024

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5046.9	253
3D Visual Grounding	Nr3D	Overall Success Rate53.1	97
3D Visual Grounding	ScanRefer v1 (test)	--	96
3D Visual Grounding	Nr3D (test)	Overall Success Rate53.1	88
Visual Grounding	ScanRefer v1 (val)	Acc@0.5 (Unique)75.5	35
3D Visual Grounding	ScanRefer (test)	--	21
Multi-object 3D grounding	Multi3DRefer (val)	F1@0.5 (ZT, no D)82.4	6
Multi-object grounding	Multi3DRefer (val)	F1@0.25 (ZT w/o D)82.4	3

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord