Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers
About
Stereo depth estimation relies on optimal correspondence matching between pixels on epipolar lines in the left and right images to infer depth. In this work, we revisit the problem from a sequence-to-sequence correspondence perspective to replace cost volume construction with dense pixel matching using position information and attention. This approach, named STereo TRansformer (STTR), has several advantages: It 1) relaxes the limitation of a fixed disparity range, 2) identifies occluded regions and provides confidence estimates, and 3) imposes uniqueness constraints during the matching process. We report promising results on both synthetic and real-world datasets and demonstrate that STTR generalizes across different domains, even without fine-tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Stereo Matching | KITTI 2015 (test) | -- | 144 | |
| Stereo Matching | KITTI 2012 (test) | -- | 76 | |
| Stereo Matching | ETH3D (test) | -- | 30 | |
| Stereo Matching | KITTI 15 | D1 Error (%)8.31 | 27 | |
| Stereo Matching | ETH3D (train) | Bad 1.0 Rate17.2 | 23 | |
| Stereo Matching | Middlebury quarter resolution (test) | Threshold Error Rate9.7 | 19 | |
| Stereo Matching | Middlebury half resolution (test) | Threshold Error Rate15.5 | 19 | |
| Depth Estimation | Gated Stereo Day 1.0 (test) | RMSE16.77 | 19 | |
| Depth Estimation | Gated Stereo Night 1.0 (test) | RMSE20.99 | 19 | |
| Stereo Matching | SCARED Set 2 Original 2019 (test) | KF1 Score7.42 | 12 |