Unifying Flow, Stereo and Depth Estimation
About
We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Stereo Matching | KITTI 2015 (test) | D1 Error (Overall)1.77 | 144 | |
| Stereo Matching | KITTI 2015 | D1 Error (All)5.72 | 118 | |
| Optical Flow Estimation | Sintel Final (test) | -- | 101 | |
| Optical Flow | KITTI 2015 (test) | Fl Error (All)4.49 | 95 | |
| Stereo Matching | KITTI 2012 | Error Rate (3px, Noc)4.87 | 81 | |
| Depth Estimation | ScanNet (test) | Abs Rel0.059 | 65 | |
| Stereo Matching | ETH3D | bad 1.02.07 | 51 | |
| Stereo Matching | Middlebury (test) | -- | 47 | |
| Optical Flow | Sintel clean (test) | AEE (Unmatched)6.68 | 37 | |
| Stereo Matching | Middlebury | Bad Pixel Rate (Thresh 2.0)11.7 | 34 |