VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

About

We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.

Dominic Maggio, Hyungtae Lim, Luca Carlone• 2025

Related benchmarks

Task	Dataset	Result
Camera pose estimation	Sintel	ATE0.303	203
3D Reconstruction	7 Scenes	Completion6.2	161
Camera pose estimation	ScanNet	RPE (t)0.049	133
Visual-Inertial Odometry	EuRoC (All sequences)	MH1 Error0.4	69
Tracking	KITTI	ATE RMSE (m)10.985	66
Camera pose estimation	TUM	ATE0.03	65
Video Depth Estimation	Sintel (test)	Delta 1 Accuracy56	61
3D Reconstruction	ETH3D	F1 Score72	50
Visual Odometry	KITTI	KITTI Seq 03 Error167.8	45
Visual Odometry	TUM-RGBD	freiburg1/desk2 Error0.568	43

Showing 10 of 96 rows

...

Other info

Follow for update

@wizwand_team Discord