Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MonSter++: Unified Stereo Matching, Multi-view Stereo, and Real-time Stereo with Monodepth Priors

About

We introduce MonSter++, a geometric foundation model for multi-view depth estimation, unifying rectified stereo matching and unrectified multi-view stereo. Both tasks fundamentally recover metric depth from correspondence search and consequently face the same dilemma: struggling to handle ill-posed regions with limited matching cues. To address this, we propose MonSter++, a novel method that integrates monocular depth priors into multi-view depth estimation, effectively combining the complementary strengths of single-view and multi-view cues. MonSter++ fuses monocular depth and multi-view depth into a dual-branched architecture. Confidence-based guidance adaptively selects reliable multi-view cues to correct scale ambiguity in monocular depth. The refined monocular predictions, in turn, effectively guide multi-view estimation in ill-posed regions. This iterative mutual enhancement enables MonSter++ to evolve coarse object-level monocular priors into fine-grained, pixel-level geometry, fully unlocking the potential of multi-view depth estimation. MonSter++ achieves new state-of-the-art on both stereo matching and multi-view stereo. By effectively incorporating monocular priors through our cascaded search and multi-scale depth fusion strategy, our real-time variant RT-MonSter++ also outperforms previous real-time methods by a large margin. As shown in Fig.1, MonSter++ achieves significant improvements over previous methods across eight benchmarks from three tasks -- stereo matching, real-time stereo matching, and multi-view stereo, demonstrating the strong generality of our framework. Besides high accuracy, MonSter++ also demonstrates superior zero-shot generalization capability. We will release both the large and the real-time models to facilitate their use by the open-source community.

Junda Cheng, Wenjing Liao, Zhipeng Cai, Longliang Liu, Gangwei Xu, Xianqi Wang, Yuzhou Wang, Zikang Yuan, Yong Deng, Jinliang Zang, Yangyang Shi, Jinhui Tang, Xin Yang• 2025

Related benchmarks

TaskDatasetResultRank
Stereo MatchingKITTI 2015
D1 Error (All)1.37
118
Stereo MatchingKITTI 2012
Error Rate (3px, Noc)0.79
81
Stereo MatchingScene Flow
EPE (px)0.37
40
Stereo MatchingETH3D
Threshold Error > 1px (All)0.45
30
Stereo MatchingDrivingStereo Zero-shot generalization
Error Rate (Sunny)2.6
15
Depth EstimationProposed Synthetic Dataset 1.0 (Evaluation set)
MAE (m)0.0862
6
Depth EstimationReal-world indoor scenes
MAE (m)0.379
6
Depth Estimation640x480 image pairs (test)
FPS3.63
5
Showing 8 of 8 rows

Other info

Follow for update