TerraMind: Large-Scale Generative Multimodality for Earth Observation
About
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | SN-7-TS (test) | mIoU60.61 | 24 | |
| Semantic segmentation | MADOS (test) | mIoU0.6952 | 19 | |
| Semantic segmentation | Pangaea Aggregate (test) | Average Rank3.56 | 19 | |
| Semantic segmentation | PASTIS (test) | mIoU40.51 | 19 | |
| Semantic segmentation | Sen1Floods11 (test) | mIoU90.62 | 19 | |
| Semantic segmentation | AI4Farms (test) | mIoU28.12 | 19 | |
| Semantic segmentation | HLS Burns (test) | mIoU82.42 | 19 | |
| Semantic segmentation | DynEarthNet (test) | mIoU37.87 | 19 | |
| Semantic segmentation | CropMap (test) | mIoU55.8 | 19 | |
| Semantic segmentation | Pangaea 10% (train) | HLS Burns77.39 | 19 |