Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

About

Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks. Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding.

Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nie{\ss}ner, Angel X. Chang• 2022

Related benchmarks

TaskDatasetResultRank
3D Dense CaptioningScanRefer (val)
CIDEr66.19
91
3D Dense CaptioningScan2Cap (val)
CIDEr (@0.5)0.467
33
Visual GroundingScanRefer v1 (val)
Acc@0.5 (All)39.1
30
3D Visual GroundingScanRefer (test)
Unique Accuracy73.1
21
Showing 4 of 4 rows

Other info

Follow for update