Endo-Depth-and-Motion: Reconstruction and Tracking in Endoscopic Videos using Depth Networks and Photometric Constraints
About
Estimating a scene reconstruction and the camera motion from in-body videos is challenging due to several factors, e.g. the deformation of in-body cavities or the lack of texture. In this paper we present Endo-Depth-and-Motion, a pipeline that estimates the 6-degrees-of-freedom camera pose and dense 3D scene models from monocular endoscopic videos. Our approach leverages recent advances in self-supervised depth networks to generate pseudo-RGBD frames, then tracks the camera pose using photometric residuals and fuses the registered depth maps in a volumetric representation. We present an extensive experimental evaluation in the public dataset Hamlyn, showing high-quality results and comparisons against relevant baselines. We also release all models and code for future comparisons.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Estimation | SCARED (test) | Abs Rel0.203 | 21 | |
| Trajectory Estimation | Drunkard's Dataset Level 0 | Success Rate [%]100 | 11 | |
| Trajectory Estimation | Drunkard's Dataset Level 1 | Frame Accuracy1 | 11 | |
| Trajectory Estimation | Drunkard's Dataset Level 2 | Frame Success Rate1 | 11 | |
| Trajectory Estimation | Drunkard's Dataset Level 3 | Frame Accuracy (%)1 | 11 | |
| Depth Estimation | Hamlyn 22 videos | Abs Rel0.216 | 10 | |
| Novel View Synthesis | C3VD average across ten scenes | PSNR18.13 | 10 | |
| Rendering | C3VD high-definition (test) | PSNR18.13 | 8 | |
| Camera Tracking | C3VD high-definition (test) | ATE (mm)1.25 | 8 | |
| Depth Reconstruction | C3VD high-definition (test) | RMSE (mm)5.1 | 8 |