MonSter++: Unified Stereo Matching, Multi-view Stereo, and Real-time Stereo with Monodepth Priors
About
We introduce MonSter++, a geometric foundation model for multi-view depth estimation, unifying rectified stereo matching and unrectified multi-view stereo. Both tasks fundamentally recover metric depth from correspondence search and consequently face the same dilemma: struggling to handle ill-posed regions with limited matching cues. To address this, we propose MonSter++, a novel method that integrates monocular depth priors into multi-view depth estimation, effectively combining the complementary strengths of single-view and multi-view cues. MonSter++ fuses monocular depth and multi-view depth into a dual-branched architecture. Confidence-based guidance adaptively selects reliable multi-view cues to correct scale ambiguity in monocular depth. The refined monocular predictions, in turn, effectively guide multi-view estimation in ill-posed regions. This iterative mutual enhancement enables MonSter++ to evolve coarse object-level monocular priors into fine-grained, pixel-level geometry, fully unlocking the potential of multi-view depth estimation. MonSter++ achieves new state-of-the-art on both stereo matching and multi-view stereo. By effectively incorporating monocular priors through our cascaded search and multi-scale depth fusion strategy, our real-time variant RT-MonSter++ also outperforms previous real-time methods by a large margin. As shown in Fig.1, MonSter++ achieves significant improvements over previous methods across eight benchmarks from three tasks -- stereo matching, real-time stereo matching, and multi-view stereo, demonstrating the strong generality of our framework. Besides high accuracy, MonSter++ also demonstrates superior zero-shot generalization capability. We will release both the large and the real-time models to facilitate their use by the open-source community.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Stereo Matching | KITTI 2015 (test) | D1 Error (Overall)1.37 | 205 | |
| Stereo Matching | KITTI 2015 | D1 Error (All)1.37 | 118 | |
| Stereo Matching | KITTI 2012 | Error Rate (3px, All)1.07 | 108 | |
| Stereo Matching | KITTI 2012 (test) | -- | 89 | |
| Stereo Matching | ETH3D | Threshold Error > 1px (Noc)0.25 | 50 | |
| Stereo Matching | Middlebury (test) | -- | 47 | |
| Stereo Matching | ETH3D (non-occluded) | Bad 1.0 Error2.03 | 43 | |
| Stereo Matching | Scene Flow | EPE (px)0.37 | 40 | |
| Stereo Matching | ETH3D (test) | -- | 34 | |
| Stereo Matching | DrivingStereo Zero-shot generalization | Error Rate (Sunny)2.6 | 15 |