Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

About

World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.

Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, Yadan Luo• 2026

Related benchmarks

TaskDatasetResultRank
Depth ForecastingCityscapes short-term
Delta 1 Accuracy93.8
13
Depth ForecastingCityscapes mid-term
Delta 187.3
13
Geometry GenerationTartanAir 1-view
Accuracy1.1604
7
Geometry GenerationTartanAir 2-views
Accuracy1.1604
7
Depth ForecastingKITTI 2011_09_26_drive_0002_sync
AbsRel (Short)5
6
Depth ForecastingKITTI 2011_10_03_drive_0047_sync
AbsRel (Short Range)0.064
6
Depth ForecastingKITTI Mean Dataset Avg
AbsRel (Short)6.5
6
Showing 7 of 7 rows

Other info

Follow for update