SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

About

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

Xiao Yang, Ronghao Fu, Zhiwen Lin, Zhuoran Duan, Jiashun Zhu, Jiasen Hu, Lang Sun, Weipeng Zhang, Jiaqi Liu, Xu Na, Haoran Liu, Weijie Zhang, Bo Yang• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	WHU-RS19	Accuracy90	104
Image Classification	AID	Accuracy84.83	83
Visual Question Answering	RSVQA-HR	--	38
Visual Question Answering	RSME-Bench	Causal Decay Rate (Δc)5.3	20
Visual Question Answering	CHOICE	Causal Decay Rate (Δc)15.22	20
Remote Sensing Reasoning	XLRS-Bench	--	18
Object Perception	DOTA (val)	Accuracy35.98	10
Object Perception	HRRSD	Accuracy68.93	10
Object Perception	VisDrone	Accuracy19	10
Remote Sensing Evaluation	MME RW-RS	MME-RW-RS Score43.63	10

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord