ViPE: Video Pose Engine for 3D Geometric Perception
About
Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera Tracking | BONN dynamic sequences | Balloon Error3.3 | 38 | |
| Camera pose estimation | Oxford Spires sparse setting | AUC@1545.35 | 18 | |
| SLAM | TUM-RGBD | XYZ Error (fr3, w)2.4 | 9 | |
| Tracking | TUM RGB-D (dynamic sequences) | ATE RMSE (ws) [cm]0.5 | 8 | |
| Tracking | Wild-SLAM MoCap Dataset | ATE RMSE (ANYmal1)0.4 | 8 | |
| Tracking | 7-scenes static | ATE RMSE0.05 | 8 | |
| Tracking | TUM RGB-D static | ATE RMSE0.065 | 8 | |
| Tracking | Sintel low-motion | ATE RMSE0.028 | 7 |