Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Back to the Features: DINO as a Foundation for Video World Models

About

We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, Piotr Bojanowski• 2025

Related benchmarks

TaskDatasetResultRank
Dense ForecastingVSPW Short horizon, ~0.2s
mIoU (best-of-20)54
6
Dense ForecastingVSPW Mid horizon, ~0.6s
mIoU (Best-20)47.9
6
Dense ForecastingCityscapes Short horizon, ~0.2s
mIoU (best-of-20)62
6
Dense ForecastingKITTI Short horizon, ~0.2s
RMSE (best-of-20)3.16
6
Dense ForecastingCityscapes Mid horizon, ~0.6s
mIoU (best-of-20)49.8
6
Dense ForecastingKITTI Mid horizon, ~0.6s
RMSE (best-of-20)4.07
6
Showing 6 of 6 rows

Other info

Follow for update