FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution
About
A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Estimation | KITTI | -- | 156 | |
| Monocular Depth Estimation | Sintel | Abs Rel0.288 | 127 | |
| Depth Estimation | Sintel ~50 frames | AbsRel0.265 | 70 | |
| Depth Estimation | KITTI 110 frames | AbsRel10.3 | 69 | |
| Monocular Depth Estimation | KITTI | AbsRel8.4 | 69 | |
| Video Depth Estimation | Bonn 110 frames | AbsRel5.3 | 63 | |
| Monocular Depth Estimation | BONN | Delta 1.25 Accuracy96.7 | 60 | |
| Depth Estimation | Sintel | AbsRel0.36 | 29 | |
| Video Depth Estimation | Scannet 90 frames | AbsRel0.101 | 22 | |
| Depth Estimation | TUM-RGBD | Abs Rel Error0.08 | 16 |