OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
About
Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera pose estimation | TUM-dynamic | ATE0.014 | 163 | |
| Video Depth Estimation | KITTI | Abs Rel0.128 | 126 | |
| Camera pose estimation | ScanNet | -- | 119 | |
| Video Depth Estimation | BONN | AbsRel5.5 | 116 | |
| 3D Reconstruction | 7 Scenes | Accuracy Median1.4 | 94 | |
| 3D Reconstruction | Neural RGB-D (NRGBD) | Acc Mean0.054 | 88 | |
| 3D Reconstruction | ETH3D Outdoor full sequences | Accuracy (Mean)0.628 | 7 | |
| 3D Reconstruction | Long3D Ultra-Long full sequences | Accuracy Error (Mean)2.449 | 6 |