TerraMind: Large-Scale Generative Multimodality for Earth Observation
About
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | Sen1Floods11 | mIoU (macro)88.42 | 29 | |
| Semantic segmentation | MADOS | mIoU67.44 | 26 | |
| Pixel-wise classification | Dominant Leaf Type Area of interest A+ | IoU80 | 26 | |
| Semantic segmentation | HLS Burn Scars | mIoU82.93 | 25 | |
| Semantic segmentation | PASTIS | Macro mIoU41.53 | 24 | |
| Semantic segmentation | SN-7-TS (test) | mIoU60.61 | 24 | |
| Semantic segmentation | MADOS (test) | mIoU0.6952 | 19 | |
| Semantic segmentation | Pangaea Aggregate (test) | Average Rank3.56 | 19 | |
| Semantic segmentation | PASTIS (test) | mIoU40.51 | 19 | |
| Semantic segmentation | DynamicEarthNet (DEN) | mIoU38.46 | 19 |