TartanVO: A Generalizable Learning-based VO
About
We present the first learning-based visual odometry (VO) model, which generalizes to multiple datasets and real-world scenarios and outperforms geometry-based methods in challenging scenes. We achieve this by leveraging the SLAM dataset TartanAir, which provides a large amount of diverse synthetic data in challenging environments. Furthermore, to make our VO model generalize across datasets, we propose an up-to-scale loss function and incorporate the camera intrinsic parameters into the model. Experiments show that a single model, TartanVO, trained only on synthetic data, without any finetuning, can be generalized to real-world datasets such as KITTI and EuRoC, demonstrating significant advantages over the geometry-based methods on challenging trajectories. Our code is available at https://github.com/castacks/tartanvo.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual-Inertial Odometry | EuRoC (All sequences) | MH1 Error0.639 | 51 | |
| Camera pose estimation | TUM freiburg1 | Rotation Error0.049 | 34 | |
| Visual Odometry | TUM-RGBD | freiburg1/xyz Error0.062 | 34 | |
| Camera pose estimation | Sintel 14-sequence | ATE23.8 | 15 | |
| Tracking | EuRoC Dataset | MH 01 Score63.9 | 13 | |
| Monocular SLAM | EuRoC (test) | ATE Error (MH03)0.55 | 12 | |
| Camera pose estimation | MPI Sintel | ATE (m)0.238 | 11 | |
| Visual Odometry | TartanAir (test) | Error MH0004.88 | 11 | |
| Simultaneous Localization and Mapping (SLAM) | TUM-RGBD (various sequences) | Error Desk0.125 | 8 | |
| Visual Odometry | nuScenes 12Hz (unseen regions) | Translation Error (m)10.27 | 8 |