VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency
About
Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular SLAM | KITTI (Sequences 00-10) | ATE RMSE Seq 037.08 | 9 | |
| Monocular SLAM | Waymo Open (test) | Metric 1634531911.35 | 6 | |
| Monocular SLAM | 4Seasons long-sequence generalization | ATE (m)12.22 | 3 | |
| Monocular SLAM | Complex Urban long-sequence generalization | ATE (m)35.48 | 3 | |
| Monocular SLAM | A2D2 long-sequence generalization | ATE (m)29.8 | 3 | |
| Monocular SLAM | TUM-Mono Handheld Sequences | Seq 17 Error10.31 | 3 |