MonSter++: Unified Stereo Matching, Multi-view Stereo, and Real-time Stereo with Monodepth Priors
About
We introduce MonSter++, a geometric foundation model for multi-view depth estimation, unifying rectified stereo matching and unrectified multi-view stereo. Both tasks fundamentally recover metric depth from correspondence search and consequently face the same dilemma: struggling to handle ill-posed regions with limited matching cues. To address this, we propose MonSter++, a novel method that integrates monocular depth priors into multi-view depth estimation, effectively combining the complementary strengths of single-view and multi-view cues. MonSter++ fuses monocular depth and multi-view depth into a dual-branched architecture. Confidence-based guidance adaptively selects reliable multi-view cues to correct scale ambiguity in monocular depth. The refined monocular predictions, in turn, effectively guide multi-view estimation in ill-posed regions. This iterative mutual enhancement enables MonSter++ to evolve coarse object-level monocular priors into fine-grained, pixel-level geometry, fully unlocking the potential of multi-view depth estimation. MonSter++ achieves new state-of-the-art on both stereo matching and multi-view stereo. By effectively incorporating monocular priors through our cascaded search and multi-scale depth fusion strategy, our real-time variant RT-MonSter++ also outperforms previous real-time methods by a large margin. As shown in Fig.1, MonSter++ achieves significant improvements over previous methods across eight benchmarks from three tasks -- stereo matching, real-time stereo matching, and multi-view stereo, demonstrating the strong generality of our framework. Besides high accuracy, MonSter++ also demonstrates superior zero-shot generalization capability. We will release both the large and the real-time models to facilitate their use by the open-source community.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Stereo Matching | KITTI 2015 | D1 Error (All)1.37 | 118 | |
| Stereo Matching | KITTI 2012 | Error Rate (3px, Noc)0.79 | 81 | |
| Stereo Matching | Scene Flow | EPE (px)0.37 | 40 | |
| Stereo Matching | ETH3D | Threshold Error > 1px (All)0.45 | 30 | |
| Stereo Matching | DrivingStereo Zero-shot generalization | Error Rate (Sunny)2.6 | 15 | |
| Depth Estimation | Proposed Synthetic Dataset 1.0 (Evaluation set) | MAE (m)0.0862 | 6 | |
| Depth Estimation | Real-world indoor scenes | MAE (m)0.379 | 6 | |
| Depth Estimation | 640x480 image pairs (test) | FPS3.63 | 5 |