Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM

About

Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs via an innovative hierarchical cross-modal attention (ie, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate FusionEO, a 30K-pair dataset with diverse annotations, and establish EarthMind-Bench, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive experiments show that EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.

Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Beg\"um Demir, Nicu Sebe, Paolo Rota• 2025

Related benchmarks

TaskDatasetResultRank
Visual GroundingVRS Bench (test)
mIoU59.61
16
Visual GroundingNWPU VHR-10 (test)
mIoU52.56
16
Visual GroundingVRSBench
mIoU59.61
15
Prognosis PredictionMulti-center clinical dataset
AUC79.34
12
Visual GroundingNWPU VHR-10
mIoU52.56
5
Showing 5 of 5 rows

Other info

Follow for update