ViTs for SITS: Vision Transformers for Satellite Image Time Series

About

In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes are made publicly available to facilitate further research.

Michail Tarasiou, Erik Chavez, Stefanos Zafeiriou• 2023

Related benchmarks

Task	Dataset	Result
Temporal Crop Segmentation	PASTIS	Threshold 10% Score3.81	11
Temporal Crop Segmentation	Germany	Performance Score @ 10%2.42	11
Semantic segmentation	PASTIS 11 (test)	Score @ T=10%3.81	10
Time Series Semantic Change Detection	DynamicEarthNet (test)	Imp. Surf. IoU22.28	10
Semantic segmentation	Germany 32 (test)	Score @ 10%2.42	10
Time Series Semantic Change Detection	MUDS	Not Build Indicator87.15	10
Semantic segmentation	PASTIS official 5-Fold (test)	mIoU65.4	7
Semantic segmentation	MTLCC (test)	mIoU84.8	7
Crop Mapping	PASTIS-MM official five-fold (val)	OA83.4	7
Satellite Image Time Series Segmentation	PASTIS Sentinel-2 (fold 1)	Overall Accuracy (OA)83.4	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord