Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MulT: An End-to-End Multitask Learning Transformer

About

We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks, including depth estimation, semantic segmentation, reshading, surface normal estimation, 2D keypoint detection, and edge detection. Based on the Swin transformer model, our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads. At the heart of our approach is a shared attention mechanism modeling the dependencies across the tasks. We evaluate our model on several multitask benchmarks, showing that our MulT framework outperforms both the state-of-the art multitask convolutional neural network models and all the respective single task transformer models. Our experiments further highlight the benefits of sharing attention across all the tasks, and demonstrate that our MulT model is robust and generalizes well to new domains. Our project website is at https://ivrl.github.io/MulT/.

Deblina Bhattacharjee, Tong Zhang, Sabine S\"usstrunk, Mathieu Salzmann• 2022

Related benchmarks

TaskDatasetResultRank
Semantic segmentationSYNTHIA-to-Cityscapes (SYN2CS) 16 classes (val)--
50
Semantic segmentationVKITTI2 -> Cityscapes 8 classes
mIoU66.12
19
Depth EstimationSYNTHIA to Cityscapes (val)
RMSE9.55
12
Depth EstimationVirtual KITTI to Cityscapes 2 (val)
RMSE10.35
12
Showing 4 of 4 rows

Other info

Follow for update