Speed3R: Sparse Feed-forward 3D Reconstruction Models
About
While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a prohibitive computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end-to-end trainable model inspired by the core principle of Structure-from-Motion: that a sparse set of keypoints is sufficient for robust pose estimation. Speed3R features a dual-branch attention mechanism where a compression branch creates a coarse contextual prior to guide a selection branch, which performs fine-grained attention only on the most informative image tokens. This strategy mimics the efficiency of traditional keypoint matching, achieving a remarkable 12.4x inference speedup on 1000-view sequences, while introducing a minimal, controlled trade-off in geometric accuracy. Validated on standard benchmarks with both VGGT and $\pi^3$ backbones, our method delivers high-quality reconstructions at a fraction of computational cost, paving the way for efficient large-scale scene modeling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera pose estimation | TUM-dynamic | ATE0.0193 | 205 | |
| Point Map Estimation | 7 Scenes | Accuracy (Mean)1.2 | 69 | |
| Relative Pose Estimation | ScanNet 1500 pairs (test) | AUC@5°37.02 | 56 | |
| Camera pose estimation | RealEstate10K | AUC@3074.81 | 46 | |
| Pose Estimation | RE10K | -- | 35 | |
| Point Map Estimation | NRGBD | Mean Accuracy0.0208 | 32 | |
| Pose Estimation | CO3D v2 | AUC@3089.41 | 19 | |
| Point Map Estimation | DTU (test) | Accuracy (Mean)1.175 | 15 | |
| Camera pose estimation | 7 Scenes | ATE0.0591 | 14 | |
| Camera pose estimation | Neural RGB-D | ATE0.0391 | 14 |