Back to the Features: DINO as a Foundation for Video World Models

About

We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, Piotr Bojanowski• 2025

Related benchmarks

Task	Dataset	Result
Dense Forecasting	VSPW Short horizon, ~0.2s	mIoU (best-of-20)54	6
Dense Forecasting	VSPW Mid horizon, ~0.6s	mIoU (Best-20)47.9	6
Dense Forecasting	Cityscapes Short horizon, ~0.2s	mIoU (best-of-20)62	6
Dense Forecasting	KITTI Short horizon, ~0.2s	RMSE (best-of-20)3.16	6
Dense Forecasting	Cityscapes Mid horizon, ~0.6s	mIoU (best-of-20)49.8	6
Dense Forecasting	KITTI Mid horizon, ~0.6s	RMSE (best-of-20)4.07	6

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord