Multimodal Token Fusion for Vision Transformers
About
Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Object Detection | ScanNet V2 (val) | mAP@0.2570.8 | 352 | |
| Semantic segmentation | NYU v2 (test) | mIoU54.2 | 248 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@13.5 | 234 | |
| Semantic segmentation | SUN RGB-D (test) | mIoU51.8 | 191 | |
| Semantic segmentation | NYUD v2 (test) | mIoU55.1 | 187 | |
| Semantic segmentation | NYU Depth V2 (test) | mIoU54.2 | 172 | |
| 3D Object Detection | SUN RGB-D (val) | mAP@0.2564.9 | 158 | |
| Semantic segmentation | NYUD v2 | mIoU54.2 | 96 | |
| 3D Object Detection | SUN RGB-D v1 (val) | mAP@0.2564.9 | 81 | |
| Semantic segmentation | SUN-RGBD (test) | mIoU51.4 | 77 |