SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning
About
Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | WHU-RS19 | Accuracy90 | 70 | |
| Image Classification | AID | Accuracy84.83 | 66 | |
| Visual Question Answering | RSVQA-HR | -- | 29 | |
| Visual Question Answering | RSME-Bench | Causal Decay Rate (Δc)5.3 | 20 | |
| Visual Question Answering | CHOICE | Causal Decay Rate (Δc)15.22 | 20 | |
| Remote Sensing Reasoning | XLRS-Bench | -- | 18 | |
| Object Perception | DOTA (val) | Accuracy35.98 | 10 | |
| Object Perception | HRRSD | Accuracy68.93 | 10 | |
| Object Perception | VisDrone | Accuracy19 | 10 | |
| Remote Sensing Evaluation | MME RW-RS | MME-RW-RS Score43.63 | 10 |