InstantSfM: Towards GPU-Native SfM for the Deep Learning Era
About
Structure-from-Motion (SfM) is a fundamental technique for recovering camera poses and scene structure from multi-view imagery, serving as a critical upstream component for applications ranging from 3D reconstruction to modern neural scene representations such as 3D Gaussian Splatting. However, most mature SfM systems remain CPU-centric and built upon traditional optimization toolchains, creating a growing mismatch with modern GPU-based, learning-driven pipelines and limiting scalability in large-scale scenes. While recent advances in GPU-accelerated bundle adjustment (BA) have demonstrated the potential of parallel sparse optimization, extending these techniques to build a complete global SfM system remains challenging due to unresolved issues in metric scale recovery and numerical robustness. In this paper, we implement a fully GPU-based and PyTorch-compatible global SfM system, named InstantSfM, to integrate seamlessly with modern learning pipelines. InstantSfM embeds metric depth priors directly into both global positioning and BA through a depth-constrained Jacobian structure, thereby resolving scale ambiguity within the optimization framework. To ensure numerical stability, we employ explicit filtering of under-constrained variables for the Jacobian matrix in an optimized GPU-friendly manner. Extensive experiments on diverse datasets demonstrate that InstantSfM achieves state-of-the-art efficiency while maintaining reconstruction accuracy comparable to both established classical pipelines and recent learning-based methods, showing up to ${\sim40\times}$ speedup over COLMAP on large-scale scenes.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | Mip-NeRF360 | PSNR28.43 | 138 | |
| Structure-from-Motion | DTU | PSNR30.83 | 30 | |
| Novel View Synthesis | Mip-NeRF 360 garden | SSIM0.869 | 14 | |
| Novel View Synthesis | Mip-NeRF 360 stump | SSIM0.711 | 14 | |
| Camera pose estimation | 7-Scenes (500 Images) | RRA@30100 | 13 | |
| Novel View Synthesis | MipNeRF360 Room | PSNR31.04 | 12 | |
| Novel View Synthesis | Mip-NeRF 360 Synthesized Varying Exposure (bicycle) | PSNR25.73 | 9 | |
| Novel View Synthesis | Mip-NeRF360 bonsai | PSNR32.06 | 7 | |
| Novel View Synthesis | Mip-NeRF360 counter | PSNR29.23 | 7 | |
| Novel View Synthesis | Mip-NeRF360 kitchen | PSNR27.79 | 7 |