MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
About
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at https://github.com/EnVision-Research/MTMamba.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | Cityscapes | mIoU90.77 | 494 | |
| Surface Normal Estimation | NYU v2 (test) | -- | 224 | |
| Depth Estimation | NYU Depth V2 | RMSE0.5066 | 209 | |
| Depth Estimation | NYU V2 | RMSE0.5066 | 167 | |
| Semantic segmentation | NYUD v2 | mIoU55.82 | 150 | |
| Depth Estimation | NYU v2 (val) | RMSE0.5066 | 65 | |
| Saliency Detection | Pascal Context (test) | maxF84.14 | 57 | |
| Surface Normal Estimation | Pascal Context (test) | mErr14.14 | 50 | |
| Surface Normal Estimation | Pascal Context | Mean Error (MAE)14.14 | 45 | |
| Saliency Detection | Pascal Context | maxF Score84.14 | 45 |