Multi-View Transformer for 3D Visual Grounding

About

The 3D visual grounding task aims to ground a natural language description to the targeted object in a 3D scene, which is usually represented in 3D point clouds. Previous works studied visual grounding under specific views. The vision-language correspondence learned by this way can easily fail once the view changes. In this paper, we propose a Multi-View Transformer (MVT) for 3D visual grounding. We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together. The multi-view space enables the network to learn a more robust multi-modal representation for 3D visual grounding and eliminates the dependence on specific views. Extensive experiments show that our approach significantly outperforms all state-of-the-art methods. Specifically, on Nr3D and Sr3D datasets, our method outperforms the best competitor by 11.2% and 7.1% and even surpasses recent work with extra 2D assistance by 5.9% and 6.6%. Our code is available at https://github.com/sega-hsj/MVT-3DVG.

Shijia Huang, Yilun Chen, Jiaya Jia, Liwei Wang• 2022

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5066.45	253
3D Visual Grounding	ScanRefer	Acc@0.533.3	142
3D Visual Grounding	Nr3D	Overall Success Rate59.5	97
3D Visual Grounding	Nr3D (test)	Overall Success Rate59.5	88
3D Visual Grounding	Sr3D (test)	Overall Accuracy64.5	73
3D Visual Grounding	ScanRefer Overall	Acc @ 0.2540.8	41
Visual Grounding	ScanRefer v1 (val)	--	35
3D referring expression comprehension	ScanRefer	Overall@0.25 Accuracy40.8	21
3D Visual Grounding	ScanRefer (test)	--	21
3D Visual Grounding	Sr3D	Overall Accuracy64.5	15

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord