Back to the Features: DINO as a Foundation for Video World Models
About
We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Dense Forecasting | VSPW Short horizon, ~0.2s | mIoU (best-of-20)54 | 6 | |
| Dense Forecasting | VSPW Mid horizon, ~0.6s | mIoU (Best-20)47.9 | 6 | |
| Dense Forecasting | Cityscapes Short horizon, ~0.2s | mIoU (best-of-20)62 | 6 | |
| Dense Forecasting | KITTI Short horizon, ~0.2s | RMSE (best-of-20)3.16 | 6 | |
| Dense Forecasting | Cityscapes Mid horizon, ~0.6s | mIoU (best-of-20)49.8 | 6 | |
| Dense Forecasting | KITTI Mid horizon, ~0.6s | RMSE (best-of-20)4.07 | 6 |