OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

About

Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy. Project page: https://vaisr.github.io/OVGGT/ Code: https://github.com/VAISR/OVGGT

Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen• 2026

Related benchmarks

Task	Dataset	Result
Camera pose estimation	TUM-dynamic	ATE0.014	205
Video Depth Estimation	KITTI	Abs Rel0.128	148
Camera pose estimation	ScanNet	--	133
Video Depth Estimation	BONN	AbsRel5.5	131
3D Reconstruction	7 Scenes	Accuracy Median1.4	128
3D Reconstruction	Neural RGB-D (NRGBD)	Acc Mean0.054	88
3D Reconstruction	ETH3D Outdoor full sequences	Accuracy (Mean)0.628	7
3D Reconstruction	Long3D Ultra-Long full sequences	Accuracy Error (Mean)2.449	6

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord