ViPE: Video Pose Engine for 3D Geometric Perception

About

Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, Sanja Fidler• 2025

Related benchmarks

Task	Dataset	Result
Camera Tracking	BONN dynamic sequences	Balloon Error3.3	38
Binary Question Answering	ACaM Synthetic Videos Binary QA (test)	Accuracy (Static)66.7	23
Video Multiple Choice Question Answering	ACaM real-world videos 1.0 (test)	Accuracy (Static)61.81	23
Camera movement understanding	ACaM synthetic videos (test)	Static Accuracy68.16	23
Camera pose estimation	Oxford Spires sparse setting	AUC@1545.35	18
SLAM	TUM-RGBD	XYZ Error (fr3, w)2.4	9
Tracking	TUM RGB-D (dynamic sequences)	ATE RMSE (ws) [cm]0.5	8
Tracking	Wild-SLAM MoCap Dataset	ATE RMSE (ANYmal1)0.4	8
Tracking	7-scenes static	ATE RMSE0.05	8
Tracking	TUM RGB-D static	ATE RMSE0.065	8

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord