DINO-Foresight: Looking into the Future with DINO
About
Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework. Project page and code at https://dino-foresight.github.io/ .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Future Semantic Segmentation | Cityscapes (test/val) | mIoU (All Classes)44.75 | 12 | |
| Future Depth Estimation | Kubric (test) | d1 Score69.31 | 12 | |
| Future Semantic Segmentation | Kubric (test) | mIoU (All Classes)57.62 | 12 | |
| Future Surface Normals Estimation | Cityscapes (test/val) | A389.87 | 12 | |
| Future Depth Estimation | Cityscapes (test/val) | d1 Score77.66 | 12 | |
| Future Surface Normals Estimation | Kubric (test) | a390.62 | 12 |