Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

About

The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationBONN
Relative Error (Rel)0.063
103
3D ReconstructionNeural RGB-D (NRGBD)
Acc Mean0.08
38
Visual OdometryTUM-RGBD
freiburg1/xyz Error0.177
34
3D Reconstruction7 Scenes
Accuracy Mean4.3
32
3D Scene Reconstruction7-Scenes (test)
Accuracy0.043
27
Visual OdometryKITTI
KITTI Seq 03 Error157.1
27
Visual OdometryEuRoC (test)
Error MH013.48
6
3D ReconstructionLong3D Dormitory (4208 Frames)
Accuracy (Mean)1.438
3
3D ReconstructionLong3D Badminton Court
Accuracy (Mean)1.843
3
3D ReconstructionLong3D Classroom
Accuracy (Mean)35.7
3
Showing 10 of 12 rows

Other info

GitHub

Follow for update