Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

About

3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds. However, only exploiting single modal information, e.g., point cloud, previous approaches fail to produce faithful descriptions. Though aggregating 2D features into point clouds may be beneficial, it introduces an extra computational burden, especially in inference phases. In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption through knowledge distillation using a teacher-student framework. In practice, during the training phase, the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints. Owing to the well-designed cross-modal feature fusion module and the feature alignment in the training phase, X-Trans2Cap acquires rich appearance information embedded in 2D images with ease. Thus, a more faithful caption can be generated only using point clouds during the inference. Qualitative and quantitative results confirm that X-Trans2Cap outperforms previous state-of-the-art by a large margin, i.e., about +21 and about +16 absolute CIDEr score on ScanRefer and Nr3D datasets, respectively.

Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Zhen Li, Shuguang Cui• 2022

Related benchmarks

TaskDatasetResultRank
3D Dense CaptioningScanRefer (val)
CIDEr106.1
91
3D Dense CaptioningScan2Cap (val)
CIDEr (@0.5)43.87
33
3D Dense CaptioningScanRefer (test)
CIDEr58.81
30
3D Dense CaptioningNr3D 1 (val)
CIDEr (IoU=0.5)33.62
22
3D Dense CaptioningScanRefer
CIDEr@0.5IoU43.87
16
3D Dense CaptioningReferIt3D Nr3D (test)
C Score (0.5 IoU)33.62
13
3D Dense CaptioningNr3D (test)
C Score @ 0.5 IoU33.62
13
Oracle 3D Dense CaptioningNr3D (val)
CIDEr85.4
10
3D Object DetectionScanRefer (test)
mAP@0.535.31
10
3D Dense CaptioningNr3D
C Score (0.5 IoU)33.62
9
Showing 10 of 11 rows

Other info

Code

Follow for update