Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

End-to-End 3D Dense Captioning with Vote2Cap-DETR

About

3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated ``detect-then-describe'' pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular \textbf{DE}tection \textbf{TR}ansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13\% and 7.11\% in CIDEr@0.5IoU, respectively. Codes will be released soon.

Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU• 2023

Related benchmarks

TaskDatasetResultRank
3D Dense CaptioningScanRefer (val)
CIDEr72.79
91
3D Dense CaptioningScan2Cap (val)
CIDEr (@0.5)61.81
33
3D Dense CaptioningScanRefer (test)
CIDEr86.28
30
3D Dense CaptioningNr3D 1 (val)
CIDEr (IoU=0.5)43.84
22
3D Dense CaptioningReferIt3D Nr3D (test)
C Score (0.5 IoU)45.53
13
3D Dense CaptioningNr3D (test)
C Score @ 0.5 IoU45.53
13
3D Dense CaptioningNr3D 1 (test)
CIDEr43.84
7
3D Dense CaptioningTOD3Cap Zero-shot OOD (test)
C @ IoU 0.2549.8
6
3D Dense CaptioningTOD3Cap In-domain (test)
C (IoU=0.25)72.8
4
Showing 9 of 9 rows

Other info

Code

Follow for update