Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing

About

Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse: Although they can accurately recognize and describe object locations in images, they often fail to faithfully execute the same spatial relations during text-to-image generation, where such relations constitute core semantic information in remote sensing. Motivated by this observation, we propose Uni-RS, the first unified multimodal model tailored for remote sensing, to explicitly address the spatial asymmetry between understanding and generation. Specifically, we first introduce explicit Spatial-Layout Planning to transform textual instructions into spatial layout plans, decoupling geometric planning from visual synthesis. We then impose Spatial-Aware Query Supervision to bias learnable queries toward spatial relations explicitly specified in the instruction. Finally, we develop Image-Caption Spatial Layout Variation to expose the model to systematic geometry-consistent spatial transformations. Extensive experiments across multiple benchmarks show that our approach substantially improves spatial faithfulness in text-to-image generation, while maintaining strong performance on multimodal understanding tasks like image captioning, visual grounding, and VQA tasks.

Weiyu Zhang, Yuan Hu, Yong Li, Yu Liu• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	RSICD	FID22.11	13
VQA	RSIEval	Average Score65.43	5
Captioning	RSIEval	B-416.8	5
Grounding	VRSBench (val)	Accuracy @ IoU 0.545.7	5
Captioning	VRSBench (val)	BLEU-412.2	5
VQA	VRSBench (val)	Accuracy74	5
Text-to-Image Generation	RSIEval	FID13.04	3
Text-to-Image Generation	VRSBench	FID4.51	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord