MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders
About
Multi-task dense scene understanding, which trains a model for multiple dense prediction tasks, has a wide range of application scenarios. Capturing long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba++, a novel architecture for multi-task scene understanding featuring with a Mamba-based decoder. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging state-space models, while CTM explicitly models task interactions to facilitate information exchange across tasks. We design two types of CTM block, namely F-CTM and S-CTM, to enhance cross-task interaction from feature and semantic perspectives, respectively. Extensive experiments on NYUDv2, PASCAL-Context, and Cityscapes datasets demonstrate the superior performance of MTMamba++ over CNN-based, Transformer-based, and diffusion-based methods while maintaining high computational efficiency. The code is available at https://github.com/EnVision-Research/MTMamba.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Surface Normal Estimation | NYU v2 (test) | -- | 206 | |
| Depth Estimation | NYU Depth V2 | RMSE0.4818 | 177 | |
| Semantic segmentation | NYUD v2 | mIoU57.01 | 96 | |
| Saliency Detection | Pascal Context (test) | maxF85.56 | 57 | |
| Surface Normal Estimation | Pascal Context (test) | mErr14.29 | 50 | |
| Boundary Detection | Pascal Context (test) | ODSF78.6 | 34 | |
| Human Part Parsing | Pascal Context (test) | mIoU72.87 | 20 | |
| Boundary Detection | NYUD v2 | ODS F-measure79.4 | 17 |