Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

About

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

U.V.B.L Udugama, George Vosselman, Francesco Nex• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationCityscapes (val)
mIoU77.6
527
Depth EstimationNYU V2
RMSE0.4196
167
Semantic segmentationNYUD v2
mIoU61.54
150
Depth EstimationNYU v2 (val)
RMSE0.4196
65
Surface Normal EstimationNYUv2 (val)
mAE13.81
19
Semantic segmentationNYUD v2 (val)
mIoU61.54
14
Boundary DetectionNYUD v2 (val)
ODS F-measure85.27
11
Depth EstimationCityscapes standard (val)
RMSE6.1
11
3D MappingITC dataset (2nd Floor)
Mean Error (m)0.11
10
3D MappingITC dataset (3rd Floor)
Mean Error (m)0.1
10
Showing 10 of 12 rows

Other info

Follow for update