XVO: Generalized Visual Odometry via Cross-Modal Self-Training
About
We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task. Specifically, we find audio prediction task to significantly enhance the semi-supervised learning process while alleviating noisy pseudo-labels, particularly in highly dynamic and out-of-domain video data. Our proposed teacher network achieves state-of-the-art performance on the commonly used KITTI benchmark despite no multi-frame optimization or knowledge of camera parameters. Combined with the proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge transfer across diverse conditions on KITTI, nuScenes, and Argoverse without fine-tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Planning | NAVSIM (test) | PDMS78.4 | 22 | |
| Visual Localization | 360SPR Pinhole (unseen) | TE (m)4.55 | 14 | |
| Visual Localization | 360Loc cross-validation (unseen) | Median Translation Error (m)2.56 | 13 | |
| Scene Pose Regression | 360SPR 1.0 (unseen) | Median Translation Error (m)4.25 | 13 | |
| Visual Localization | 360Loc official (seen) | Median Translation Error (m)2.43 | 13 | |
| Scene Pose Regression | 360SPR 1.0 (seen) | Median Translation Error (m)4.11 | 13 | |
| Visual Odometry | Argoverse 10Hz 2 (unseen camera setups) | Translational Error (t_err)9.13 | 8 | |
| Visual Odometry | nuScenes 12Hz (unseen regions) | Translation Error (m)12.75 | 8 | |
| Visual Odometry | KITTI 10Hz (00-10) | Translational Error16.82 | 8 | |
| Visual Localization | 7Scenes Pinhole (unseen environments) | Translation Error (m)0.7 | 7 |