Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Streaming 4D Visual Geometry Transformer

About

Perceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction. This design can handle real-time 4D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operator (e.g., FlashAttention) from the field of large language models. Extensive experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios while maintaining competitive performance, paving the way for scalable and interactive 4D vision systems. Code is available at: https://github.com/wzzheng/StreamVGGT.

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu• 2025

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationSintel
Relative Error (Rel)0.323
109
Video Depth EstimationBONN
Relative Error (Rel)0.059
103
Camera pose estimationSintel
ATE0.251
92
Camera pose estimationScanNet
ATE RMSE (Avg.)0.161
61
Video Depth EstimationSintel (test)
Delta 1 Accuracy65.7
57
Video Depth EstimationBonn (test)
Abs Rel0.059
37
Visual OdometryTUM-RGBD
freiburg1/xyz Error0.185
34
3D Reconstruction7 Scenes--
32
Visual OdometryKITTI
KITTI Seq 03 Error164.8
27
Video Depth EstimationKITTI (test)
Delta172.1
25
Showing 10 of 38 rows

Other info

Follow for update