Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail
About
We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Stereo Matching | KITTI 2015 | D1 Error (All)3.93 | 118 | |
| Stereo Matching | KITTI 2012 | Error Rate (3px, Noc)3.52 | 81 | |
| Stereo Matching | ETH3D | Threshold Error > 1px (All)1.66 | 30 | |
| Stereo Matching | Booster Q (test) | Error Rate (> 2%)6.52 | 26 | |
| Stereo Depth Estimation | SQUID zero-shot | Relative Error (Rel)0.0952 | 16 | |
| Stereo Matching | LayeredFlow E (test) | Error Rate (> 1%)51.24 | 13 | |
| Stereo Depth Estimation | TartanAir underwater (test) | Relative Error (Rel)0.0592 | 13 | |
| Stereo Matching | Middlebury half-resolution 2014 v3 (test) | Bad Error Rate (All)6.96 | 11 | |
| Stereo Matching | Middlebury 2021 | Bad Pixel Rate (Thresh > 2.0, All)7.97 | 11 | |
| DSM Reconstruction | Omaha Synchronic DFC2019 | Altitude MAE (m)1.04 | 8 |