UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
About
Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Weather forecasting | SEVIR | -- | 20 | |
| Traffic Flow Prediction | TaxiBJ | RMSE0.54 | 13 | |
| Video Prediction | MMNIST | -- | 12 | |
| Traffic Control | TaxiBJ | PSNR39.6 | 8 | |
| Trajectory Prediction and Robot Action Planning | Human3.6M | PSNR33.2 | 8 | |
| Driving Scene Prediction | Cityscapes | PSNR27.4 | 7 | |
| Trajectory Prediction and Robot Action Planning | Bair | PSNR20.3 | 7 | |
| Driving Scene Prediction | ETH | PSNR28.4 | 5 | |
| Driving Scene Prediction | KITTI | PSNR17.2 | 5 | |
| Traffic Control | Traffic4Cast | PSNR30.6 | 5 |