MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
About
Feed-forward geometric foundation models can infer dense point clouds and camera motion directly from RGB streams, providing priors for monocular SLAM. However, their predictions are often view-dependent and noisy: geometry can vary across viewpoints and under image transformations, and local metric properties may drift between frames. We present MonoEM-GS, a monocular mapping pipeline that integrates such geometric predictions into a global Gaussian Splatting representation while explicitly addressing these inconsistencies. MonoEM-GS couples Gaussian Splatting with an Expectation--Maximization formulation to stabilize geometry, and employs ICP-based alignment for monocular pose estimation. Beyond geometry, MonoEM-GS parameterizes Gaussians with multi-modal features, enabling in-place open-set segmentation and other downstream queries directly on the reconstructed map. We evaluate MonoEM-GS on 7-Scenes, TUM RGB-D and Replica, and compare against recent baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera Tracking | TUM RGB-D | ATE RMSE (cm)12 | 18 | |
| Dense Reconstruction | TUM RGB-D | Completion Error0.15 | 9 | |
| 3D Semantic Segmentation | Replica 3D | mIoU31.5 | 5 | |
| Mapping | 7 Scenes | Accuracy7 | 5 | |
| Localization | 7 Scenes | ATE RMSE0.08 | 5 | |
| Trajectory Estimation | Replica 3D | ATE RMSE13.1 | 3 |