InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding
About
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models' ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Surface Normal Estimation | NYU v2 (test) | -- | 206 | |
| Depth Estimation | NYU Depth V2 | RMSE0.5096 | 177 | |
| Semantic segmentation | NYUD v2 | mIoU53.85 | 96 | |
| Saliency Detection | Pascal Context (test) | maxF84.74 | 57 | |
| Surface Normal Estimation | Pascal Context (test) | mErr13.73 | 50 | |
| Boundary Detection | Pascal Context (test) | ODSF74.2 | 34 | |
| Human Part Parsing | Pascal Context (test) | mIoU69.12 | 20 | |
| Boundary Detection | NYUD v2 | ODS F-measure78.1 | 17 |