ViTs for SITS: Vision Transformers for Satellite Image Time Series
About
In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes are made publicly available to facilitate further research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Crop Segmentation | PASTIS | Threshold 10% Score3.81 | 11 | |
| Temporal Crop Segmentation | Germany | Performance Score @ 10%2.42 | 11 | |
| Semantic segmentation | PASTIS 11 (test) | Score @ T=10%3.81 | 10 | |
| Time Series Semantic Change Detection | DynamicEarthNet (test) | Imp. Surf. IoU22.28 | 10 | |
| Semantic segmentation | Germany 32 (test) | Score @ 10%2.42 | 10 | |
| Time Series Semantic Change Detection | MUDS | Not Build Indicator87.15 | 10 | |
| Semantic segmentation | PASTIS official 5-Fold (test) | mIoU65.4 | 7 | |
| Semantic segmentation | MTLCC (test) | mIoU84.8 | 7 | |
| Crop Mapping | PASTIS-MM official five-fold (val) | OA83.4 | 7 |