M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

About

Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

Kerui Ren, Guanghao Li, Changjian Jiang, Yingxiang Xu, Tao Lu, Linning Xu, Junting Dong, Jiangmiao Pang, Mulin Yu, Bo Dai• 2026

Related benchmarks

Task	Dataset	Result
Novel View Synthesis	ScanNet++	PSNR27.789	74
Novel View Synthesis	Waymo	PSNR28.346	34
Appearance Rendering	ScanNet V2	PSNR27.08	19
Appearance Rendering	FAST-LIVO2	PSNR25.48	17
Appearance Rendering	Waymo	PSNR28.94	14
Appearance Rendering	VR-NeRF	PSNR29.64	14
Appearance Rendering	KITTI	PSNR22.47	14
Appearance Rendering	ScanNet++	PSNR28.82	14
Tracking	Waymo	ATE RMSE (m)0.773	7
Tracking	KITTI	ATE RMSE (m)0.89	7

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord