Geometric Context Transformer for Streaming 3D Reconstruction
About
Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Reconstruction | 7 Scenes | -- | 94 | |
| 3D Reconstruction | NRGBD | -- | 44 | |
| Pose Estimation | ETH3D | AUC @ Threshold 30.2779 | 41 | |
| 3D Reconstruction | ETH3D | F1 Score98.98 | 25 | |
| Camera pose estimation | Oxford Spires sparse setting | AUC@1561.64 | 18 | |
| Pose and trajectory estimation | 7 Scenes | AUC312.63 | 9 | |
| Pose and trajectory estimation | Tanks&Temples | AUC345.8 | 9 |