InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding

About

Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models' ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.

Hanrong Ye, Dan Xu• 2023

Related benchmarks

Task	Dataset	Result
Surface Normal Estimation	NYU v2 (test)	--	224
Depth Estimation	NYU Depth V2	RMSE0.5096	209
Depth Estimation	NYU V2	RMSE0.5096	167
Semantic segmentation	NYUD v2	mIoU53.85	150
Saliency Detection	Pascal Context (test)	maxF84.74	57
Surface Normal Estimation	Pascal Context (test)	mErr13.73	50
Surface Normal Estimation	Pascal Context	Mean Error (MAE)13.73	45
Saliency Detection	Pascal Context	maxF Score84.74	45
Semantic segmentation	Pascal Context	mIoU80.22	42
Surface Normal Estimation	NYUD	mErr18.67	38

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord