M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
About
Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | Cityscapes (val) | mIoU77.6 | 527 | |
| Depth Estimation | NYU V2 | RMSE0.4196 | 167 | |
| Semantic segmentation | NYUD v2 | mIoU61.54 | 150 | |
| Depth Estimation | NYU v2 (val) | RMSE0.4196 | 65 | |
| Surface Normal Estimation | NYUv2 (val) | mAE13.81 | 19 | |
| Semantic segmentation | NYUD v2 (val) | mIoU61.54 | 14 | |
| Boundary Detection | NYUD v2 (val) | ODS F-measure85.27 | 11 | |
| Depth Estimation | Cityscapes standard (val) | RMSE6.1 | 11 | |
| 3D Mapping | ITC dataset (2nd Floor) | Mean Error (m)0.11 | 10 | |
| 3D Mapping | ITC dataset (3rd Floor) | Mean Error (m)0.1 | 10 |