EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM

About

Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs via an innovative hierarchical cross-modal attention (ie, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate FusionEO, a 30K-pair dataset with diverse annotations, and establish EarthMind-Bench, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive experiments show that EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.

Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Beg\"um Demir, Nicu Sebe, Paolo Rota• 2025

Related benchmarks

Task	Dataset	Result
Visual Grounding	VRS Bench (test)	mIoU59.61	16
Visual Grounding	NWPU VHR-10 (test)	mIoU52.56	16
Visual Grounding	VRSBench	mIoU59.61	15
Prognosis Prediction	Multi-center clinical dataset	AUC79.34	12
Multiple Choice Question (MCQ)	EarthMind-Bench SAR	Scene Classification Accuracy64.4	6
Open-Ended VQA (OE)	EarthMind-Bench SAR	Image Captioning Score3.1	6
Multiple Choice Question (MCQ)	EarthMind-Bench Optical	Scene Classification Accuracy64.3	6
Open-Ended VQA (OE)	EarthMind-Bench Optical	Image Captioning Score3.35	6
Visual Grounding	NWPU VHR-10	mIoU52.56	5
Classification	BigEarthNet MS	Recall71.2	4

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord