ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association

About

We present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35\% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods. Github repository: https://github.com/zhangganlin/vista-slam

Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers• 2025

Related benchmarks

Task	Dataset	Result
Absolute Trajectory Estimation	TUM RGB-D	Desk Error0.03	36
Pose Estimation	7 Scenes	Average Median Translation Error (m)5.5	29
3D Geometry Reconstruction	BundleFusion	Chamfer Distance0.06	8
3D Geometry Reconstruction	7Scenes	Chamfer Distance0.1	8
Relative Pose Estimation	7Scenes	AUC@5°13	8
Relative Pose Estimation	BundleFusion	AUC@5°15	8
Pose Estimation	Aria office environment	ATE (Floor 1 Seq 1)0.84	6
Pose Estimation	Habitat-Matterport 3D (selected scenes)	Pose Error Metric A0.351	5
Dense Reconstruction	Aria Office Dataset (Floor 1)	Reconstruction Score (Room 0)25.5	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord