Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing

About

Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse: Although they can accurately recognize and describe object locations in images, they often fail to faithfully execute the same spatial relations during text-to-image generation, where such relations constitute core semantic information in remote sensing. Motivated by this observation, we propose Uni-RS, the first unified multimodal model tailored for remote sensing, to explicitly address the spatial asymmetry between understanding and generation. Specifically, we first introduce explicit Spatial-Layout Planning to transform textual instructions into spatial layout plans, decoupling geometric planning from visual synthesis. We then impose Spatial-Aware Query Supervision to bias learnable queries toward spatial relations explicitly specified in the instruction. Finally, we develop Image-Caption Spatial Layout Variation to expose the model to systematic geometry-consistent spatial transformations. Extensive experiments across multiple benchmarks show that our approach substantially improves spatial faithfulness in text-to-image generation, while maintaining strong performance on multimodal understanding tasks like image captioning, visual grounding, and VQA tasks.

Weiyu Zhang, Yuan Hu, Yong Li, Yu Liu• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationRSICD
FID22.11
13
VQARSIEval
Average Score65.43
5
CaptioningRSIEval
B-416.8
5
GroundingVRSBench (val)
Accuracy @ IoU 0.545.7
5
CaptioningVRSBench (val)
BLEU-412.2
5
VQAVRSBench (val)
Accuracy74
5
Text-to-Image GenerationRSIEval
FID13.04
3
Text-to-Image GenerationVRSBench
FID4.51
3
Showing 8 of 8 rows

Other info

Follow for update