Light3R-SfM: Towards Feed-forward Structure-from-Motion
About
We present Light3R-SfM, a feed-forward, end-to-end learnable framework for efficient large-scale Structure-from-Motion (SfM) from unconstrained image collections. Unlike existing SfM solutions that rely on costly matching and global optimization to achieve accurate 3D reconstructions, Light3R-SfM addresses this limitation through a novel latent global alignment module. This module replaces traditional global optimization with a learnable attention mechanism, effectively capturing multi-view constraints across images for robust and precise camera pose estimation. Light3R-SfM constructs a sparse scene graph via retrieval-score-guided shortest path tree to dramatically reduce memory usage and computational overhead compared to the naive approach. Extensive experiments demonstrate that Light3R-SfM achieves competitive accuracy while significantly reducing runtime, making it ideal for 3D reconstruction tasks in real-world applications with a runtime constraint. This work pioneers a data-driven, feed-forward SfM approach, paving the way toward scalable, accurate, and efficient 3D reconstruction in the wild.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Structure-from-Motion | Tanks&Temples | Registration Score1 | 15 | |
| Camera pose estimation | CO3D 10-view v2 | RRA@1594.7 | 12 | |
| Multi-View Pose Estimation | Tanks&Temples 25-view | RRA@550.9 | 9 | |
| Multi-View Pose Estimation | Tanks&Temples 50-view | RRA@552.5 | 9 | |
| Multi-View Pose Estimation | Tanks&Temples 100-view | RRA@554.3 | 9 | |
| Multi-View Pose Estimation | Tanks&Temples 200-view | RRA@552.4 | 9 | |
| Multi-View Pose Estimation | Tanks&Temples (full sequence) | Registration Error100 | 8 | |
| Camera pose estimation | CO3D 2-view v2 | RRA@1595.5 | 4 | |
| Camera pose estimation | Waymo Open Dataset (val) | RRA@578.3 | 3 |