Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multimodal Token Fusion for Vision Transformers

About

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.

Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, Yunhe Wang• 2022

Related benchmarks

TaskDatasetResultRank
3D Object DetectionScanNet V2 (val)
mAP@0.2570.8
352
Semantic segmentationNYU v2 (test)
mIoU54.2
248
Text-to-Video RetrievalMSR-VTT (test)
R@13.5
234
Semantic segmentationSUN RGB-D (test)
mIoU51.8
191
Semantic segmentationNYUD v2 (test)
mIoU55.1
187
Semantic segmentationNYU Depth V2 (test)
mIoU54.2
172
3D Object DetectionSUN RGB-D (val)
mAP@0.2564.9
158
Semantic segmentationNYUD v2
mIoU54.2
96
3D Object DetectionSUN RGB-D v1 (val)
mAP@0.2564.9
81
Semantic segmentationSUN-RGBD (test)
mIoU51.4
77
Showing 10 of 48 rows

Other info

Code

Follow for update