M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

About

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

U.V.B.L Udugama, George Vosselman, Francesco Nex• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	Cityscapes (val)	mIoU77.6	527
Depth Estimation	NYU V2	RMSE0.4196	167
Semantic segmentation	NYUD v2	mIoU61.54	150
Depth Estimation	NYU v2 (val)	RMSE0.4196	65
Surface Normal Estimation	NYUv2 (val)	mAE13.81	19
Semantic segmentation	NYUD v2 (val)	mIoU61.54	14
Boundary Detection	NYUD v2 (val)	ODS F-measure85.27	11
Depth Estimation	Cityscapes standard (val)	RMSE6.1	11
3D Mapping	ITC dataset (2nd Floor)	Mean Error (m)0.11	10
3D Mapping	ITC dataset (3rd Floor)	Mean Error (m)0.1	10

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord