SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images

About

This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera. The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency. Additionally, the paper incorporates an adversarial training component, employing a Wasserstein GAN framework with a critic network, to refine model's predictions. The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks. We also conducted ablation studies to analyze the contributions of different components, including pre-training strategies, the inclusion of critics, the use of logarithmic depth scaling, and advanced image augmentations, to provide a better understanding of the proposed framework. The accompanying source code is accessible at \url{https://github.com/PardisTaghavi/SwinMTL}.

Pardis Taghavi, Reza Langari, Gaurav Pandey• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	Cityscapes (test)	mIoU76.41	1252
Semantic segmentation	Cityscapes	mIoU85.04	494
Depth Estimation	NYU v2 (test)	--	435
Semantic segmentation	NYU Depth V2 (test)	mIoU58.14	183
Depth Estimation	NYU V2	RMSE0.5179	167
Semantic segmentation	NYUD v2	mIoU58.14	150
Semantic segmentation	NYU V2	mIoU58.1	74
Depth Estimation	Cityscapes	--	65
Semantic segmentation	VOC 2012	mIoU76.41	55
Depth Prediction	Cityscapes (test)	RMSE5.481	52

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord